Computation and Language 81
☆ Continuous Autoregressive Language Models
The efficiency of large language models (LLMs) is fundamentally limited by
their sequential, token-by-token generation process. We argue that overcoming
this bottleneck requires a new design axis for LLM scaling: increasing the
semantic bandwidth of each generative step. To this end, we introduce
Continuous Autoregressive Language Models (CALM), a paradigm shift from
discrete next-token prediction to continuous next-vector prediction. CALM uses
a high-fidelity autoencoder to compress a chunk of K tokens into a single
continuous vector, from which the original tokens can be reconstructed with
over 99.9\% accuracy. This allows us to model language as a sequence of
continuous vectors instead of discrete tokens, which reduces the number of
generative steps by a factor of K. The paradigm shift necessitates a new
modeling toolkit; therefore, we develop a comprehensive likelihood-free
framework that enables robust training, evaluation, and controllable sampling
in the continuous domain. Experiments show that CALM significantly improves the
performance-compute trade-off, achieving the performance of strong discrete
baselines at a significantly lower computational cost. More importantly, these
findings establish next-vector prediction as a powerful and scalable pathway
towards ultra-efficient language models. Code:
https://github.com/shaochenze/calm. Project:
https://shaochenze.github.io/blog/2025/CALM.
☆ Culture Cartography: Mapping the Landscape of Cultural Knowledge EMNLP 2025
To serve global users safely and productively, LLMs need culture-specific
knowledge that might not be learned during pre-training. How do we find such
knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The
most common solutions are single-initiative: either researchers define
challenging questions that users passively answer (traditional annotation), or
users actively produce data that researchers structure as benchmarks (knowledge
extraction). The process would benefit from mixed-initiative collaboration,
where users guide the process to meaningfully reflect their cultures, and LLMs
steer the process towards more challenging questions that meet the researcher's
goals. We propose a mixed-initiative methodology called CultureCartography.
Here, an LLM initializes annotation with questions for which it has
low-confidence answers, making explicit both its prior knowledge and the gaps
therein. This allows a human respondent to fill these gaps and steer the model
towards salient topics through direct edits. We implement this methodology as a
tool called CultureExplorer. Compared to a baseline where humans answer
LLM-proposed questions, we find that CultureExplorer more effectively produces
knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even
with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B
by up to 19.2% on related culture benchmarks.
comment: EMNLP 2025
☆ SpecAttn: Speculating Sparse Attention NeurIPS 2025
Large Language Models (LLMs) face significant computational bottlenecks
during inference due to the quadratic complexity of self-attention mechanisms,
particularly as context lengths increase. We introduce SpecAttn, a novel
training-free approach that seamlessly integrates with existing speculative
decoding techniques to enable efficient sparse attention in pre-trained
transformers. Our key insight is to exploit the attention weights already
computed by the draft model during speculative decoding to identify important
tokens for the target model, eliminating redundant computation while
maintaining output quality. SpecAttn employs three core techniques: KL
divergence-based layer alignment between draft and target models, a
GPU-optimized sorting-free algorithm for top-p token selection from draft
attention patterns, and dynamic key-value cache pruning guided by these
predictions. By leveraging the computational work already performed in standard
speculative decoding pipelines, SpecAttn achieves over 75% reduction in
key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19
dataset, significantly outperforming existing sparse attention methods. Our
approach demonstrates that speculative execution can be enhanced to provide
approximate verification without significant performance degradation.
comment: Accepted to NeurIPS 2025 Workshop on Structured Probabilistic
Inference & Generative Modeling
☆ Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui, Yu-Xiong Wang, Huan Zhang, Heng Ji, Daniel Kang
Multimodal large language models (MLLMs) have advanced embodied agents by
enabling direct perception, reasoning, and planning task-oriented actions from
visual inputs. However, such vision driven embodied agents open a new attack
surface: visual backdoor attacks, where the agent behaves normally until a
visual trigger appears in the scene, then persistently executes an
attacker-specified multi-step policy. We introduce BEAT, the first framework to
inject such visual backdoors into MLLM-based embodied agents using objects in
the environments as triggers. Unlike textual triggers, object triggers exhibit
wide variation across viewpoints and lighting, making them difficult to implant
reliably. BEAT addresses this challenge by (1) constructing a training set that
spans diverse scenes, tasks, and trigger placements to expose agents to trigger
variability, and (2) introducing a two-stage training scheme that first applies
supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning
(CTL). CTL formulates trigger discrimination as preference learning between
trigger-present and trigger-free inputs, explicitly sharpening the decision
boundaries to ensure precise backdoor activation. Across various embodied agent
benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while
maintaining strong benign task performance, and generalizes reliably to
out-of-distribution trigger placements. Notably, compared to naive SFT, CTL
boosts backdoor activation accuracy up to 39% under limited backdoor data.
These findings expose a critical yet unexplored security risk in MLLM-based
embodied agents, underscoring the need for robust defenses before real-world
deployment.
☆ Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
The prevailing video retrieval paradigm is structurally misaligned, as narrow
benchmarks incentivize correspondingly limited data and single-task training.
Therefore, universal capability is suppressed due to the absence of a
diagnostic evaluation that defines and demands multi-dimensional
generalization. To break this cycle, we introduce a framework built on the
co-design of evaluation, data, and modeling. First, we establish the Universal
Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to
measure performance but also to diagnose critical capability gaps across tasks
and domains. Second, guided by UVRB's diagnostics, we introduce a scalable
synthesis workflow that generates 1.55 million high-quality pairs to populate
the semantic space required for universality. Finally, we devise the Modality
Pyramid, a curriculum that trains our General Video Embedder (GVE) by
explicitly leveraging the latent interconnections within our diverse data.
Extensive experiments show GVE achieves state-of-the-art zero-shot
generalization on UVRB. In particular, our analysis reveals that popular
benchmarks are poor predictors of general ability and that partially relevant
retrieval is a dominant but overlooked scenario. Overall, our co-designed
framework provides a practical path to escape the limited scope and advance
toward truly universal video retrieval.
☆ MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval
Large Language Models (LLMs) excel at reasoning and generation but are
inherently limited by static pretraining data, resulting in factual
inaccuracies and weak adaptability to new information. Retrieval-Augmented
Generation (RAG) addresses this issue by grounding LLMs in external knowledge;
However, the effectiveness of RAG critically depends on whether the model can
adequately access relevant information. Existing RAG systems rely on a single
retriever with fixed top-k selection, restricting access to a narrow and static
subset of the corpus. As a result, this single-retriever paradigm has become
the primary bottleneck for comprehensive external information acquisition,
especially in tasks requiring corpus-level reasoning. To overcome this
limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG
framework that enables LLMs to dynamically coordinate multiple retrieval
mechanisms for broader and more precise information access. MARAG-R1 equips the
model with four retrieval tools -- semantic search, keyword search, filtering,
and aggregation -- and learns both how and when to use them through a two-stage
training process: supervised fine-tuning followed by reinforcement learning.
This design allows the model to interleave reasoning and retrieval,
progressively gathering sufficient evidence for corpus-level synthesis.
Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that
MARAG-R1 substantially outperforms strong baselines and achieves new
state-of-the-art results in corpus-level reasoning tasks.
☆ SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning
Solving mathematical reasoning problems requires not only accurate access to
relevant knowledge but also careful, multi-step thinking. However, current
retrieval-augmented models often rely on a single perspective, follow
inflexible search strategies, and struggle to effectively combine information
from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge
Integration for AGentic Mathematical reAsoning), a unified framework that
orchestrates specialized agents to independently reason, perform targeted
searches, and synthesize findings through a moderator mechanism. Each agent
generates hypothetical passages to optimize retrieval for its analytic
perspective, ensuring knowledge integration is both context-sensitive and
computation-efficient. When evaluated on challenging benchmarks such as
MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms
both open- and closed-source systems, achieving an absolute performance
improvement of 7.4%. Our results demonstrate that multi-agent, on-demand
knowledge integration significantly enhances both reasoning accuracy and
efficiency, offering a scalable approach for complex, knowledge-intensive
problem-solving. We will release the code upon publication.
comment: Short Paper - Under Review
☆ Data-Efficient Domain Adaptation for LLM-based MT using Contrastive Preference Optimization
LLMs often require adaptation to domain-specific requirements, a process that
can be expensive when relying solely on SFT. We present an empirical study on
applying CPO to simulate a post-editing workflow for data-efficient domain
adaptation. Our approach synthesizes preference pairs by treating the base
model's own raw output as the 'rejected' translation and the human-approved TM
entry as the 'chosen' one. This method provides direct feedback on the model's
current knowledge, guiding it to align with domain-specific standards.
Experiments in English-Brazilian Portuguese and English-Korean show that, by
using just 14.7k preference pairs, the model achieves performance close to that
of a model trained on 160k+ samples with SFT, demonstrating significant data
efficiency. Although we showcase its effectiveness in MT, this application of
CPO naturally generalizes to other generative tasks where a model's initial
drafts can serve as a contrastive signal against a golden reference.
☆ Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality
Yinghao Luo, Lang Zhou, Amrish Jhingoer, Klaske Vliegenthart Jongbloed, Carlijn Jordans, Ben Werkhoven, Tom Seinen, Erik van Mulligen, Casper Rokx, Yunlei Li
In multilingual healthcare applications, the availability of domain-specific
natural language processing(NLP) tools is limited, especially for low-resource
languages. Although multilingual bidirectional encoder representations from
transformers (BERT) offers a promising motivation to mitigate the language gap,
the medical NLP tasks in low-resource languages are still underexplored.
Therefore, this study investigates how further pre-training on domain-specific
corpora affects model performance on medical tasks, focusing on three
languages: Dutch, Romanian and Spanish. In terms of further pre-training, we
conducted four experiments to create medical domain models. Then, these models
were fine-tuned on three downstream tasks: Automated patient screening in Dutch
clinical notes, named entity recognition in Romanian and Spanish clinical
notes. Results show that domain adaptation significantly enhanced task
performance. Furthermore, further differentiation of domains, e.g. clinical and
general biomedical domains, resulted in diverse performances. The clinical
domain-adapted model outperformed the more general biomedical domain-adapted
model. Moreover, we observed evidence of cross-lingual transferability.
Moreover, we also conducted further investigations to explore potential reasons
contributing to these performance differences. These findings highlight the
feasibility of domain adaptation and cross-lingual ability in medical NLP.
Within the low-resource language settings, these findings can provide
meaningful guidance for developing multilingual medical NLP systems to mitigate
the lack of training data and thereby improve the model performance.
☆ DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models
Malik H. Altakrori, Nizar Habash, Abdelhakim Freihat, Younes Samih, Kirill Chirkunov, Muhammed AbuOdeh, Radu Florian, Teresa Lynn, Preslav Nakov, Alham Fikri Aji
We present DialectalArabicMMLU, a new benchmark for evaluating the
performance of large language models (LLMs) across Arabic dialects. While
recently developed Arabic and multilingual benchmarks have advanced LLM
evaluation for Modern Standard Arabic (MSA), dialectal varieties remain
underrepresented despite their prevalence in everyday communication.
DialectalArabicMMLU extends the MMLU-Redux framework through manual translation
and adaptation of 3K multiple-choice question-answer pairs into five major
dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of
15K QA pairs across 32 academic and professional domains (22K QA pairs when
also including English and MSA). The benchmark enables systematic assessment of
LLM reasoning and comprehension beyond MSA, supporting both task-based and
linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs
(1B-13B parameters) and report substantial performance variation across
dialects, revealing persistent gaps in dialectal generalization.
DialectalArabicMMLU provides the first unified, human-curated resource for
measuring dialectal understanding in Arabic, thus promoting more inclusive
evaluation and future model development.
comment: 9 pages, 9 tables
☆ Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design
Maria Lizarazo Jimenez, Ana Gabriela Claros, Kieran Green, David Toro-Tobon, Felipe Larios, Sheena Asthana, Camila Wenczenovicz, Kerly Guevara Maldonado, Luis Vilatuna-Andrango, Cristina Proano-Velez, Satya Sai Sri Bandi, Shubhangi Bagewadi, Megan E. Branda, Misk Al Zahidy, Saturnino Luz, Mirella Lapata, Juan P. Brito, Oscar J. Ponce-Ponte
Large Language Models (LLMs) are increasingly demonstrating the potential to
reach human-level performance in generating clinical summaries from
patient-clinician conversations. However, these summaries often focus on
patients' biology rather than their preferences, values, wishes, and concerns.
To achieve patient-centered care, we propose a new standard for Artificial
Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries
(PCS). Our objective was to develop a framework to generate PCS that capture
patient values and ensure clinical utility and to assess whether current
open-source LLMs can achieve human-level performance in this task. We used a
mixed-methods process. Two Patient and Public Involvement groups (10 patients
and 8 clinicians) in the United Kingdom participated in semi-structured
interviews exploring what personal and contextual information should be
included in clinical summaries and how it should be structured for clinical
use. Findings informed annotation guidelines used by eight clinicians to create
gold-standard PCS from 88 atrial fibrillation consultations. Sixteen
consultations were used to refine a prompt aligned with the guidelines. Five
open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and
Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot
prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients
emphasized lifestyle routines, social support, recent stressors, and care
values. Clinicians sought concise functional, psychosocial, and emotional
context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L
0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B
(ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between
experts and models, while correctness and patient-centeredness favored human
PCS.
comment: The first two listed authors contributed equally Pages: 21;
Figures:2; Tables:3
☆ SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps EMNLP
We introduce SQLSpace, a human-interpretable, generalizable, compact
representation for text-to-SQL examples derived with minimal human
intervention. We demonstrate the utility of these representations in evaluation
with three use cases: (i) closely comparing and contrasting the composition of
popular text-to-SQL benchmarks to identify unique dimensions of examples they
evaluate, (ii) understanding model performance at a granular level beyond
overall accuracy scores, and (iii) improving model performance through targeted
query rewriting based on learned correctness estimation. We show that SQLSpace
enables analysis that would be difficult with raw examples alone: it reveals
compositional differences between benchmarks, exposes performance patterns
obscured by accuracy alone, and supports modeling of query success.
comment: Accepted to EMNLP Findings
☆ BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization ICDM
Transformer-based architectures have advanced text summarization, yet their
quadratic complexity limits scalability on long documents. This paper
introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a
novel framework that combines sparse attention, adaptive spans, and bilinear
attention to address these limitations. Sparse attention reduces computational
costs by focusing on the most relevant parts of the input, while adaptive spans
dynamically adjust the attention ranges. Bilinear attention complements both by
modeling complex token interactions within this refined context. BiSparse-AAS
consistently outperforms state-of-the-art baselines in both extractive and
abstractive summarization tasks, achieving average ROUGE improvements of about
68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance
on OpenWebText and Gigaword datasets. By addressing efficiency, scalability,
and long-sequence modeling, BiSparse-AAS provides a unified, practical solution
for real-world text summarization applications.
comment: Accepted at the IEEE International Conference on Data Mining (ICDM)
2025, Washington, DC, USA
☆ Effect of Domain Generalization Techniques in Low Resource Systems
Machine learning models typically assume that training and test data follow
the same distribution, an assumption that often fails in real-world scenarios
due to distribution shifts. This issue is especially pronounced in low-resource
settings, where data scarcity and limited domain diversity hinder robust
generalization. Domain generalization (DG) approaches address this challenge by
learning features that remain invariant across domains, often using causal
mechanisms to improve model robustness. In this study, we examine two distinct
causal DG techniques in low-resource natural language tasks. First, we
investigate a causal data augmentation (CDA) approach that automatically
generates counterfactual examples to improve robustness to spurious
correlations. We apply this method to sentiment classification on the
NaijaSenti Twitter corpus, expanding the training data with semantically
equivalent paraphrases to simulate controlled distribution shifts. Second, we
explore an invariant causal representation learning (ICRL) approach using the
DINER framework, originally proposed for debiasing aspect-based sentiment
analysis. We adapt DINER to a multilingual setting. Our findings demonstrate
that both approaches enhance robustness to unseen domains: counterfactual data
augmentation yields consistent cross-domain accuracy gains in sentiment
classification, while causal representation learning with DINER improves
out-of-distribution performance in multilingual sentiment analysis, albeit with
varying gains across languages.
☆ Thought Branches: Interpreting LLM Reasoning Requires Resampling
Most work interpreting reasoning models studies only a single
chain-of-thought (CoT), yet these models define distributions over many
possible CoTs. We argue that studying a single sample is inadequate for
understanding causal influence and the underlying computation. Though fully
specifying this distribution is intractable, it can be understood by sampling.
We present case studies using resampling to investigate model decisions. First,
when a model states a reason for its action, does that reason actually cause
the action? In "agentic misalignment" scenarios, we resample specific sentences
to measure their downstream effects. Self-preservation sentences have small
causal impact, suggesting they do not meaningfully drive blackmail. Second, are
artificial edits to CoT sufficient for steering reasoning? These are common in
literature, yet take the model off-policy. Resampling and selecting a
completion with the desired property is a principled on-policy alternative. We
find off-policy interventions yield small and unstable effects compared to
resampling in decision-making tasks. Third, how do we understand the effect of
removing a reasoning step when the model may repeat it post-edit? We introduce
a resilience metric that repeatedly resamples to prevent similar content from
reappearing downstream. Critical planning statements resist removal but have
large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can
our methods teach us anything in these settings? Adapting causal mediation
analysis, we find that hints that have a causal effect on the output without
being explicitly mentioned exert a subtle and cumulative influence on the CoT
that persists even if the hint is removed. Overall, studying distributions via
resampling enables reliable causal analysis, clearer narratives of model
reasoning, and principled CoT interventions.
comment: Uzay Macar and Paul C. Bogdan contributed equally to this work, and
their listed order was determined by coinflip
☆ The aftermath of compounds: Investigating Compounds and their Semantic Representations AACL
This study investigates how well computational embeddings align with human
semantic judgments in the processing of English compound words. We compare
static word vectors (GloVe) and contextualized embeddings (BERT) against human
ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn
from a psycholinguistic dataset. Using measures of association strength
(Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC),
we compute embedding-derived LMD and ST metrics and assess their relationships
with human judgments via Spearmans correlation and regression analyses. Our
results show that BERT embeddings better capture compositional semantics than
GloVe, and that predictability ratings are strong predictors of semantic
transparency in both human and model data. These findings advance computational
psycholinguistics by clarifying the factors that drive compound word processing
and offering insights into embedding-based semantic modeling.
comment: IJCNLP-AACL SRW 2025
☆ Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning
In recent years, large language models (LLMs) have witnessed remarkable
advancements, with the test-time scaling law consistently enhancing the
reasoning capabilities. Through systematic evaluation and exploration of a
diverse spectrum of intermediate thoughts, LLMs demonstrate the potential to
generate deliberate reasoning steps, thereby substantially enhancing reasoning
accuracy. However, LLMs' autoregressive generation paradigm results in
reasoning performance scaling sub-optimally with test-time computation, often
requiring excessive computational overhead to propose thoughts while yielding
only marginal performance gains. In contrast, diffusion language models (DLMs)
can efficiently produce diverse samples through parallel denoising in a single
forward pass, inspiring us to leverage them for proposing intermediate
thoughts, thereby alleviating the computational burden associated with
autoregressive generation while maintaining quality. In this work, we propose
an efficient collaborative reasoning framework, leveraging DLMs to generate
candidate thoughts and LLMs to evaluate their quality. Experiments across
diverse benchmarks demonstrate that our framework achieves strong performance
in complex reasoning tasks, offering a promising direction for future research.
Our code is open-source at
https://anonymous.4open.science/r/Diffuse-Thinking-EC60.
☆ VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision
Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has
emerged as a crucial technique for enhancing the reasoning abilities of large
language models (LLMs). However, the standard cross-entropy loss treats all
tokens equally, ignoring their heterogeneous contributions across a reasoning
trajectory. This uniform treatment leads to misallocated supervision and weak
generalization, especially in complex, long-form reasoning tasks. To address
this, we introduce \textbf{V}ariance-\textbf{C}ontrolled
\textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled
framework that reformulates CoT supervision as a constrained optimization
problem. By adopting an optimization-theoretic perspective, VCORE enables a
principled and adaptive allocation of supervision across tokens, thereby
aligning the training objective more closely with the goal of robust reasoning
generalization. Empirical evaluations demonstrate that VCORE consistently
outperforms existing token reweighting methods. Across both in-domain and
out-of-domain settings, VCORE achieves substantial performance gains on
mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B,
32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more
effective initialization for subsequent reinforcement learning, establishing a
stronger foundation for advancing the reasoning capabilities of LLMs. The Code
will be released at https://github.com/coder-gx/VCORE.
comment: Under Review
☆ DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
Large Reasoning Models (LRMs) have demonstrated impressive capabilities but
suffer from cognitive inefficiencies like ``overthinking'' simple problems and
``underthinking'' complex ones. While existing methods that use supervised
fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can
improve efficiency, they often do so at the cost of accuracy. This paper
introduces \textbf{DeepCompress}, a novel framework that simultaneously
enhances both the accuracy and efficiency of LRMs. We challenge the prevailing
approach of consistently favoring shorter reasoning paths, showing that longer
responses can contain a broader range of correct solutions for difficult
problems. DeepCompress employs an adaptive length reward mechanism that
dynamically classifies problems as ``Simple'' or ``Hard'' in real-time based on
the model's evolving capability. It encourages shorter, more efficient
reasoning for ``Simple'' problems while promoting longer, more exploratory
thought chains for ``Hard'' problems. This dual-reward strategy enables the
model to autonomously adjust its Chain-of-Thought (CoT) length, compressing
reasoning for well-mastered problems and extending it for those it finds
challenging. Experimental results on challenging mathematical benchmarks show
that DeepCompress consistently outperforms baseline methods, achieving superior
accuracy while significantly improving token efficiency.
comment: Work in progress
☆ Dynamic Affective Memory Management for Personalized LLM Agents
Advances in large language models are making personalized AI agents a new
research focus. While current agent systems primarily rely on personalized
external memory databases to deliver customized experiences, they face
challenges such as memory redundancy, memory staleness, and poor memory-context
integration, largely due to the lack of effective memory updates during
interaction. To tackle these issues, we propose a new memory management system
designed for affective scenarios. Our approach employs a Bayesian-inspired
memory update algorithm with the concept of memory entropy, enabling the agent
to autonomously maintain a dynamically updated memory vector database by
minimizing global entropy to provide more personalized services. To better
evaluate the system's effectiveness in this context, we propose DABench, a
benchmark focusing on emotional expression and emotional change toward objects.
Experimental results demonstrate that, our system achieves superior performance
in personalization, logical coherence, and accuracy. Ablation studies further
validate the effectiveness of the Bayesian-inspired update mechanism in
alleviating memory bloat. Our work offers new insights into the design of
long-term memory systems.
comment: 12 pasges, 8 figures
☆ Atlas-Alignment: Making Interpretability Transferable Across Language Models
Interpretability is crucial for building safe, reliable, and controllable
language models, yet existing interpretability pipelines remain costly and
difficult to scale. Interpreting a new model typically requires costly training
of model-specific sparse autoencoders, manual or semi-automated labeling of SAE
components, and their subsequent validation. We introduce Atlas-Alignment, a
framework for transferring interpretability across language models by aligning
unknown latent spaces to a Concept Atlas - a labeled, human-interpretable
latent space - using only shared inputs and lightweight representational
alignment techniques. Once aligned, this enables two key capabilities in
previously opaque models: (1) semantic feature search and retrieval, and (2)
steering generation along human-interpretable atlas concepts. Through
quantitative and qualitative evaluations, we show that simple representational
alignment methods enable robust semantic retrieval and steerable generation
without the need for labeled concept data. Atlas-Alignment thus amortizes the
cost of explainable AI and mechanistic interpretability: by investing in one
high-quality Concept Atlas, we can make many new models transparent and
controllable at minimal marginal cost.
☆ Awal -- Community-Powered Language Technology for Tamazight
This paper presents Awal, a community-powered initiative for developing
language technology resources for Tamazight. We provide a comprehensive review
of the NLP landscape for Tamazight, examining recent progress in computational
resources, and the emergence of community-driven approaches to address
persistent data scarcity. Launched in 2024, awaldigital.org platform addresses
the underrepresentation of Tamazight in digital spaces through a collaborative
platform enabling speakers to contribute translation and voice data. We analyze
18 months of community engagement, revealing significant barriers to
participation including limited confidence in written Tamazight and ongoing
standardization challenges. Despite widespread positive reception, actual data
contribution remained concentrated among linguists and activists. The modest
scale of community contributions -- 6,421 translation pairs and 3 hours of
speech data -- highlights the limitations of applying standard crowdsourcing
approaches to languages with complex sociolinguistic contexts. We are working
on improved open-source MT models using the collected data.
comment: Accepted to the International Conference on Information and
Communication Technologies for Amazigh (TICAM 25)
☆ Balancing Knowledge Updates: Toward Unified Modular Editing in LLMs
Knowledge editing has emerged as an efficient approach for updating factual
knowledge in large language models (LLMs). It typically locates knowledge
storage modules and then modifies their parameters. However, most existing
methods focus on the weights of multilayer perceptron (MLP) modules, which are
often identified as the main repositories of factual information. Other
components, such as attention (Attn) modules, are often ignored during editing.
This imbalance can leave residual outdated knowledge and limit editing
effectiveness. We perform comprehensive knowledge localization experiments on
advanced LLMs and find that Attn modules play a substantial role in factual
knowledge storage and retrieval, especially in earlier layers. Based on these
insights, we propose IntAttn-Edit, a method that extends the associative memory
paradigm to jointly update both MLP and Attn modules. Our approach uses a
knowledge balancing strategy that allocates update magnitudes in proportion to
each module's measured contribution to knowledge storage. Experiments on
standard benchmarks show that IntAttn-Edit achieves higher edit success, better
generalization, and stronger knowledge preservation than prior methods. Further
analysis shows that the balancing strategy keeps editing performance within an
optimal range across diverse settings.
☆ Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning.
Since any long, serial reasoning process must pass through this textual trace,
the quality of the CoT is a direct window into what the model is thinking. This
visibility could help us spot unsafe or misaligned behavior (monitorability),
but only if the CoT is transparent about its internal reasoning (faithfulness).
Fully measuring faithfulness is difficult, so researchers often focus on
examining the CoT in cases where the model changes its answer after adding a
cue to the input. This proxy finds some instances of unfaithfulness but loses
information when the model maintains its answer, and does not investigate
aspects of reasoning not tied to the cue. We extend these results to a more
holistic sense of monitorability by introducing verbosity: whether the CoT
lists every factor needed to solve the task. We combine faithfulness and
verbosity into a single monitorability score that shows how well the CoT serves
as the model's external `working memory', a property that many safety schemes
based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning
models on BBH, GPQA, and MMLU. Our results show that models can appear faithful
yet remain hard to monitor when they leave out key factors, and that
monitorability differs sharply across model families. We release our evaluation
code using the Inspect library to support reproducible future work.
comment: Project page at https://ajmeek.github.io/cot_monitorability_website/
☆ From the Rock Floor to the Cloud: A Systematic Survey of State-of-the-Art NLP in Battery Life Cycle
Tosin Adewumi, Martin Karlsson, Marcus Liwicki, Mikael Sjödahl, Lama Alkhaled, Rihab Gargouri, Nudrat Habib, Franz Hennie
We present a comprehensive systematic survey of the application of natural
language processing (NLP) along the entire battery life cycle, instead of one
stage or method, and introduce a novel technical language processing (TLP)
framework for the EU's proposed digital battery passport (DBP) and other
general battery predictions. We follow the Preferred Reporting Items for
Systematic Reviews and Meta-Analyses (PRISMA) method and employ three reputable
databases or search engines, including Google Scholar, Institute of Electrical
and Electronics Engineers Xplore (IEEE Xplore), and Scopus. Consequently, we
assessed 274 scientific papers before the critical review of the final 66
relevant papers. We publicly provide artifacts of the review for validation and
reproducibility. The findings show that new NLP tasks are emerging in the
battery domain, which facilitate materials discovery and other stages of the
life cycle. Notwithstanding, challenges remain, such as the lack of standard
benchmarks. Our proposed TLP framework, which incorporates agentic AI and
optimized prompts, will be apt for tackling some of the challenges.
comment: 15 pages, 3 images
☆ ThoughtProbe: Classifier-Guided LLM Thought Space Exploration via Probing Representations EMNLP2025
This paper introduces ThoughtProbe, a novel inference time framework that
leverages the hidden reasoning features of Large Language Models (LLMs) to
improve their reasoning performance. Unlike previous works that manipulate the
hidden representations to steer LLM generation, we harness them as
discriminative signals to guide the tree structured response space exploration.
In each node expansion, a classifier serves as a scoring and ranking mechanism
that efficiently allocates computational resources by prioritizing higher score
candidates for continuation. After completing the tree expansion, we collect
answers from all branches to form a candidate answer pool. We then propose a
branch aggregation method that marginalizes over all supporting branches by
aggregating their CoT scores, thereby identifying the optimal answer from the
pool. Experimental results show that our framework's comprehensive exploration
not only covers valid reasoning chains but also effectively identifies them,
achieving significant improvements across multiple arithmetic reasoning
benchmarks.
comment: EMNLP2025 main conference
☆ TransAlign: Machine Translation Encoders are Strong Word Aligners, Too
In the absence of sizable training data for most world languages and NLP
tasks, translation-based strategies such as translate-test -- evaluating on
noisy source language data translated from the target language -- and
translate-train -- training on noisy target language data translated from the
source language -- have been established as competitive approaches for
cross-lingual transfer (XLT). For token classification tasks, these strategies
require label projection: mapping the labels from each token in the original
sentence to its counterpart(s) in the translation. To this end, it is common to
leverage multilingual word aligners (WAs) derived from encoder language models
such as mBERT or LaBSE. Despite obvious associations between machine
translation (MT) and WA, research on extracting alignments with MT models is
largely limited to exploiting cross-attention in encoder-decoder architectures,
yielding poor WA results. In this work, in contrast, we propose TransAlign, a
novel word aligner that utilizes the encoder of a massively multilingual MT
model. We show that TransAlign not only achieves strong WA performance but
substantially outperforms popular WA and state-of-the-art non-WA-based label
projection methods in MT-based XLT for token classification.
☆ A Unified Representation Underlying the Judgment of Large Language Models
A central architectural question for both biological and artificial
intelligence is whether judgment relies on specialized modules or a unified,
domain-general resource. While the discovery of decodable neural
representations for distinct concepts in Large Language Models (LLMs) has
suggested a modular architecture, whether these representations are truly
independent systems remains an open question. Here we provide evidence for a
convergent architecture. Across a range of LLMs, we find that diverse
evaluative judgments are computed along a dominant dimension, which we term the
Valence-Assent Axis (VAA). This axis jointly encodes subjective valence ("what
is good") and the model's assent to factual claims ("what is true"). Through
direct interventions, we show this unified representation creates a critical
dependency: the VAA functions as a control signal that steers the generative
process to construct a rationale consistent with its evaluative state, even at
the cost of factual accuracy. This mechanism, which we term the subordination
of reasoning, shifts the process of reasoning from impartial inference toward
goal-directed justification. Our discovery offers a mechanistic account for
systemic bias and hallucination, revealing how an architecture that promotes
coherent judgment can systematically undermine faithful reasoning.
☆ Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity
Understanding how language-model outputs relate to the pretraining corpus is
central to studying model behavior. Most training data attribution (TDA)
methods ask which training examples causally influence a given output, often
using leave-one-out tests. We invert the question: which outputs cannot be
attributed to any pretraining example? We introduce un-attributability as an
operational measure of semantic novelty: an output is novel if the pretraining
corpus contains no semantically similar context. We approximate this with a
simple two-stage retrieval pipeline: index the corpus with lightweight GIST
embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the
nearest corpus item is less attributable than a human-generated text reference,
we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2
and report three findings: (1) models draw on pretraining data across much
longer spans than previously reported; (2) some domains systematically promote
or suppress novelty; and (3) instruction tuning not only alters style but also
increases novelty. Reframing novelty assessment around un-attributability
enables efficient analysis at pretraining scale. We release ~20 TB of corpus
chunks and index artifacts to support replication and large-scale extension of
our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm
☆ Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
Reasoning language models (RLMs) achieve strong performance on complex
reasoning tasks, yet they still suffer from a multilingual reasoning gap,
performing better in high-resource languages than in low-resource ones. While
recent efforts have reduced this gap, its underlying causes remain largely
unexplored. In this paper, we address this by showing that the multilingual
reasoning gap largely stems from failures in language understanding-the model's
inability to represent the multilingual input meaning into the dominant
language (i.e., English) within its reasoning trace. This motivates us to
examine whether understanding failures can be detected, as this ability could
help mitigate the multilingual reasoning gap. To this end, we evaluate a range
of detection methods and find that understanding failures can indeed be
identified, with supervised approaches performing best. Building on this, we
propose Selective Translation, a simple yet effective strategy that translates
the multilingual input into English only when an understanding failure is
detected. Experimental results show that Selective Translation bridges the
multilingual reasoning gap, achieving near full-translation performance while
using translation for only about 20% of inputs. Together, our work demonstrates
that understanding failures are the primary cause of the multilingual reasoning
gap and can be detected and selectively mitigated, providing key insight into
its origin and a promising path toward more equitable multilingual reasoning.
Our code and data are publicly available at
https://github.com/deokhk/RLM_analysis.
☆ MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models
Kangkun Mao, Jinru Ding, Jiayuan Chen, Mouxiao Bian, Ruiyao Chen, Xinwei Peng, Sijie Ren, Linyang Li, Jie Xu
As large language models (LLMs) enter the medical domain, most benchmarks
evaluate them on question answering or descriptive reasoning, overlooking
quantitative reasoning critical to clinical decision-making. Existing datasets
like MedCalc-Bench cover few calculation tasks and fail to reflect real-world
computational scenarios.
We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical
calculation abilities, comprising 700+ tasks across two types: equation-based
(e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar,
Glasgow Coma Scale). These tasks span diverse specialties including internal
medicine, surgery, pediatrics, and cardiology, offering a broader and more
challenging evaluation setting.
To improve performance, we further develop MedCalc-Env, a reinforcement
learning environment built on the InternBootcamp framework, enabling multi-step
clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this
environment achieves state-of-the-art results on MedCalc-Eval, with notable
gains in numerical sensitivity, formula selection, and reasoning robustness.
Remaining challenges include unit conversion, multi-condition logic, and
contextual understanding.
Code and datasets are available at
https://github.com/maokangkun/MedCalc-Eval.
☆ Higher-order Linear Attention
The quadratic cost of scaled dot-product attention is a central obstacle to
scaling autoregressive language models to long contexts. Linear-time attention
and State Space Models (SSMs) provide scalable alternatives but are typically
restricted to first-order or kernel-based approximations, which can limit
expressivity. We introduce Higher-order Linear Attention (HLA), a causal,
streaming mechanism that realizes higher interactions via compact prefix
sufficient statistics. In the second-order case, HLA maintains a constant-size
state and computes per-token outputs in linear time without materializing any
$n \times n$ matrices. We give closed-form streaming identities, a strictly
causal masked variant using two additional summaries, and a chunk-parallel
training scheme based on associative scans that reproduces the activations of a
serial recurrence exactly. We further outline extensions to third and higher
orders. Collectively, these results position HLA as a principled, scalable
building block that combines attention-like, data-dependent mixing with the
efficiency of modern recurrent architectures. Project Page:
https://github.com/yifanzhang-pro/HLA.
comment: Project Page: https://github.com/yifanzhang-pro/HLA
☆ Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Instruction-tuned Large Language Models (LLMs) underperform on low resource,
non-Latin scripts due to tokenizer fragmentation and weak cross-lingual
coupling. We present LLINK (Latent Language Injection for Non-English
Knowledge), a compute efficient language-as-modality method that conditions an
instruction-tuned decoder without changing the tokenizer or retraining the
decoder. First, we align sentence embeddings from a frozen multilingual encoder
to the decoder's latent embedding space at a reserved position via a
lightweight contrastive projector. Second, the vector is expanded into K soft
slots and trained with minimal adapters so the frozen decoder consumes the
signal. LLINK substantially improves bilingual retrieval and achieves 81.3%
preference over the base model and 63.6% over direct fine-tuning in LLM-judged
Q&A evaluations. We further find that improvements can be attributed to reduced
tokenization inflation and a stronger cross lingual alignment, despite the
model having residual weaknesses in numeric fidelity. Treating low resource
languages as a modality offers a practical path to stronger cross-lingual
alignment in lightweight LLMs.
comment: 14 pages, 3 Figures
☆ Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Evaluating the abilities of large language models (LLMs) for tasks that
require long-term memory and thus long-context reasoning, for example in
conversational settings, is hampered by the existing benchmarks, which often
lack narrative coherence, cover narrow domains, and only test simple
recall-oriented tasks. This paper introduces a comprehensive solution to these
challenges. First, we present a novel framework for automatically generating
long (up to 10M tokens), coherent, and topically diverse conversations,
accompanied by probing questions targeting a wide range of memory abilities.
From this, we construct BEAM, a new benchmark comprising 100 conversations and
2,000 validated questions. Second, to enhance model performance, we propose
LIGHT-a framework inspired by human cognition that equips LLMs with three
complementary memory systems: a long-term episodic memory, a short-term working
memory, and a scratchpad for accumulating salient facts. Our experiments on
BEAM reveal that even LLMs with 1M token context windows (with and without
retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT
consistently improves performance across various models, achieving an average
improvement of 3.5%-12.69% over the strongest baselines, depending on the
backbone LLM. An ablation study further confirms the contribution of each
memory component.
☆ Identifying the Periodicity of Information in Natural Language
Recent theoretical advancement of information density in natural language has
brought the following question on desk: To what degree does natural language
exhibit periodicity pattern in its encoded information? We address this
question by introducing a new method called AutoPeriod of Surprisal (APS). APS
adopts a canonical periodicity detection algorithm and is able to identify any
significant periods that exist in the surprisal sequence of a single document.
By applying the algorithm to a set of corpora, we have obtained the following
interesting results: Firstly, a considerable proportion of human language
demonstrates a strong pattern of periodicity in information; Secondly, new
periods that are outside the distributions of typical structural units in text
(e.g., sentence boundaries, elementary discourse units, etc.) are found and
further confirmed via harmonic regression modeling. We conclude that the
periodicity of information in language is a joint outcome from both structured
factors and other driving factors that take effect at longer distances. The
advantages of our periodicity detection method and its potentials in
LLM-generation detection are further discussed.
☆ DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries SIGMOD 2026
Manually conducting real-world data analyses is labor-intensive and
inefficient. Despite numerous attempts to automate data science workflows, none
of the existing paradigms or systems fully demonstrate all three key
capabilities required to support them effectively: (1) open-domain data
collection, (2) structured data transformation, and (3) analytic reasoning.
To overcome these limitations, we propose DRAMA, an end-to-end paradigm that
answers users' analytic queries in natural language on large-scale open-domain
data. DRAMA unifies data collection, transformation, and analysis as a single
pipeline. To quantitatively evaluate system performance on tasks representative
of DRAMA, we construct a benchmark, DRAMA-Bench, consisting of two categories
of tasks: claim verification and question answering, each comprising 100
instances. These tasks are derived from real-world applications that have
gained significant public attention and require the retrieval and analysis of
open-domain data. We develop DRAMA-Bot, a multi-agent system designed following
DRAMA. It comprises a data retriever that collects and transforms data by
coordinating the execution of sub-agents, and a data analyzer that performs
structured reasoning over the retrieved data. We evaluate DRAMA-Bot on
DRAMA-Bench together with five state-of-the-art baseline agents. DRAMA-Bot
achieves 86.5% task accuracy at a cost of $0.05, outperforming all baselines
with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is
publicly available at https://github.com/uiuc-kang-lab/drama.
comment: Accepted to SIGMOD 2026
☆ MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models EMNLP 2025
The proliferation of memes on social media necessitates the capabilities of
multimodal Large Language Models (mLLMs) to effectively understand multimodal
harmfulness. Existing evaluation approaches predominantly focus on mLLMs'
detection accuracy for binary classification tasks, which often fail to reflect
the in-depth interpretive nuance of harmfulness across diverse contexts. In
this paper, we propose MemeArena, an agent-based arena-style evaluation
framework that provides a context-aware and unbiased assessment for mLLMs'
understanding of multimodal harmfulness. Specifically, MemeArena simulates
diverse interpretive contexts to formulate evaluation tasks that elicit
perspective-specific analyses from mLLMs. By integrating varied viewpoints and
reaching consensus among evaluators, it enables fair and unbiased comparisons
of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments
demonstrate that our framework effectively reduces the evaluation biases of
judge agents, with judgment results closely aligning with human preferences,
offering valuable insights into reliable and comprehensive mLLM evaluations in
multimodal harmfulness understanding. Our code and data are publicly available
at https://github.com/Lbotirx/MemeArena.
comment: EMNLP 2025
☆ Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
As AI systems become increasingly integrated into human lives, endowing them
with robust social intelligence has emerged as a critical frontier. A key
aspect of this intelligence is discerning truth from deception, a ubiquitous
element of human interaction that is conveyed through a complex interplay of
verbal language and non-verbal visual cues. However, automatic deception
detection in dynamic, multi-party conversations remains a significant
challenge. The recent rise of powerful Multimodal Large Language Models
(MLLMs), with their impressive abilities in visual and textual understanding,
makes them natural candidates for this task. Consequently, their capabilities
in this crucial domain are mostly unquantified. To address this gap, we
introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and
present a novel multimodal dataset derived from the social deduction game
Werewolf. This dataset provides synchronized video, text, with verifiable
ground-truth labels for every statement. We establish a comprehensive benchmark
evaluating state-of-the-art MLLMs, revealing a significant performance gap:
even powerful models like GPT-4o struggle to distinguish truth from falsehood
reliably. Our analysis of failure modes indicates that these models fail to
ground language in visual social cues effectively and may be overly
conservative in their alignment, highlighting the urgent need for novel
approaches to building more perceptive and trustworthy AI systems.
☆ Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+
Mason Shipton, York Hay Ng, Aditya Khan, Phuong Hanh Hoang, Xiang Lu, A. Seza Doğruöz, En-Shiun Annie Lee
The URIEL+ linguistic knowledge base supports multilingual research by
encoding languages through geographic, genetic, and typological vectors.
However, data sparsity remains prevalent, in the form of missing feature types,
incomplete language entries, and limited genealogical coverage. This limits the
usefulness of URIEL+ in cross-lingual transfer, particularly for supporting
low-resource languages. To address this sparsity, this paper extends URIEL+
with three contributions: introducing script vectors to represent writing
system properties for 7,488 languages, integrating Glottolog to add 18,710
additional languages, and expanding lineage imputation for 26,449 languages by
propagating typological and script features across genealogies. These additions
reduce feature sparsity by 14% for script vectors, increase language coverage
by up to 19,015 languages (1,007%), and improve imputation quality metrics by
up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around
low-resource languages) shows occasionally divergent performance compared to
URIEL+, with performance gains up to 6% in certain setups. Our advances make
URIEL+ more complete and inclusive for multilingual research.
☆ Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, Hari Balakrishnan
Can an AI autonomously design mechanisms for computer systems on par with the
creativity and reasoning of human experts? We present Glia, an AI architecture
for networked systems design that uses large language models (LLMs) in a
human-inspired, multi-agent workflow. Each agent specializes in reasoning,
experimentation, and analysis, collaborating through an evaluation framework
that grounds abstract reasoning in empirical feedback. Unlike prior
ML-for-systems methods that optimize black-box policies, Glia generates
interpretable designs and exposes its reasoning process. When applied to a
distributed GPU cluster for LLM inference, it produces new algorithms for
request routing, scheduling, and auto-scaling that perform at human-expert
levels in significantly less time, while yielding novel insights into workload
behavior. Our results suggest that by combining reasoning LLMs with structured
experimentation, an AI can produce creative and understandable designs for
complex systems problems.
☆ Probability Distributions Computed by Hard-Attention Transformers
Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang
Most expressivity results for transformers treat them as language recognizers
(which accept or reject strings), and not as they are used in practice, as
language models (which generate strings autoregressively and
probabilistically). Here, we characterize the probability distributions that
transformer language models can express. We show that making transformer
language recognizers autoregressive can sometimes increase their expressivity,
and that making them probabilistic can break equivalences that hold in the
non-probabilistic case. Our overall contribution is to tease apart what
functions transformers are capable of expressing, in their most common use-case
as language models.
comment: 18 pages
☆ Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks EMNLP 2025
As Natural Language Generation (NLG) continues to be widely adopted, properly
assessing it has become quite difficult. Lately, using large language models
(LLMs) for evaluating these generations has gained traction, as they tend to
align more closely with human preferences than conventional n-gram or
embedding-based metrics. In our experiments, we show that LLM judges have low
intra-rater reliability in their assigned scores across different runs. This
variance makes their ratings inconsistent, almost arbitrary in the worst case,
making it difficult to measure how good their judgments actually are. We
quantify this inconsistency across different NLG tasks and benchmarks and see
if judicious use of LLM judges can still be useful following proper guidelines.
comment: Accepted at EMNLP 2025
☆ Characterizing Selective Refusal Bias in Large Language Models
Safety guardrails in large language models(LLMs) are developed to prevent
malicious users from generating toxic content at a large scale. However, these
measures can inadvertently introduce or reflect new biases, as LLMs may refuse
to generate harmful content targeting some demographic groups and not others.
We explore this selective refusal bias in LLM guardrails through the lens of
refusal rates of targeted individual and intersectional demographic groups,
types of LLM responses, and length of generated refusals. Our results show
evidence of selective refusal bias across gender, sexual orientation,
nationality, and religion attributes. This leads us to investigate additional
safety implications via an indirect attack, where we target previously refused
groups. Our findings emphasize the need for more equitable and robust
performance in safety guardrails across demographic groups.
comment: 21 pages, 12 figures, 14 tables
☆ Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models
This paper addresses the limitations of large-scale language models in safety
alignment and robustness by proposing a fine-tuning method that combines
contrastive distillation with noise-robust training. The method freezes the
backbone model and transfers the knowledge boundaries of the teacher model to
the student model through distillation, thereby improving semantic consistency
and alignment accuracy. At the same time, noise perturbations and robust
optimization constraints are introduced during training to ensure that the
model maintains stable predictive outputs under noisy and uncertain inputs. The
overall framework consists of distillation loss, robustness loss, and a
regularization term, forming a unified optimization objective that balances
alignment ability with resistance to interference. To systematically validate
its effectiveness, the study designs experiments from multiple perspectives,
including distillation weight sensitivity, stability analysis under computation
budgets and mixed-precision environments, and the impact of data noise and
distribution shifts on model performance. Results show that the method
significantly outperforms existing baselines in knowledge transfer, robustness,
and overall safety, achieving the best performance across several key metrics.
This work not only enriches the theoretical system of parameter-efficient
fine-tuning but also provides a new solution for building safer and more
trustworthy alignment mechanisms.
☆ Towards a Measure of Algorithm Similarity
Given two algorithms for the same problem, can we determine whether they are
meaningfully different? In full generality, the question is uncomputable, and
empirically it is muddied by competing notions of similarity. Yet, in many
applications (such as clone detection or program synthesis) a pragmatic and
consistent similarity metric is necessary. We review existing equivalence and
similarity notions and introduce EMOC: An
Evaluation-Memory-Operations-Complexity framework that embeds algorithm
implementations into a feature space suitable for downstream tasks. We compile
PACD, a curated dataset of verified Python implementations across three
problems, and show that EMOC features support clustering and classification of
algorithm types, detection of near-duplicates, and quantification of diversity
in LLM-generated programs. Code, data, and utilities for computing EMOC
embeddings are released to facilitate reproducibility and future work on
algorithm similarity.
comment: 11 pages, many figures and images
♻ ☆ The End of Manual Decoding: Towards Truly End-to-End Language Models
Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a
non-differentiable decoding process that requires laborious, hand-tuning of
hyperparameters like temperature and top-p. This paper introduces AutoDeco, a
novel architecture that enables truly "end-to-end" generation by learning to
control its own decoding strategy. We augment the standard transformer with
lightweight heads that, at each step, dynamically predict context-specific
temperature and top-p values alongside the next-token logits. This approach
transforms decoding into a parametric, token-level process, allowing the model
to self-regulate its sampling strategy within a single forward pass.
Through extensive experiments on eight benchmarks, we demonstrate that
AutoDeco not only significantly outperforms default decoding strategies but
also achieves performance comparable to an oracle-tuned baseline derived from
"hacking the test set"-a practical upper bound for any static method.
Crucially, we uncover an emergent capability for instruction-based decoding
control: the model learns to interpret natural language commands (e.g.,
"generate with low randomness") and adjusts its predicted temperature and top-p
on a token-by-token basis, opening a new paradigm for steerable and interactive
LLM decoding.
♻ ☆ SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks EMNLP 2025
Direct alignment algorithms have proven an effective step for aligning
language models to human-desired behaviors. Current variants of the Direct
Preference Optimization objective have focused on a strict setting where all
tokens are contributing signals of KL divergence and rewards to the loss
function. However, human preference is not affected equally by each word in a
sequence but is often dependent on specific words or phrases, e.g. existence of
toxic terms leads to non-preferred responses. Based on this observation, we
argue that not all tokens should be weighted equally during PO and propose a
flexible objective termed SparsePO, that aims to automatically learn to weight
the KL divergence and reward corresponding to each token during PO training. We
propose two different variants of weight-masks that can either be derived from
the reference model itself or learned on the fly. Notably, our method induces
sparsity in the learned masks, allowing the model to learn how to best balance
reward and KL divergence contributions at the token level, learning an optimal
level of mask sparsity. Extensive experiments illustrate the effectiveness of
our approach at aligning to preference proxies, including sentiment control,
helpfulness and harmlessness, and summary quality. Our method obtains +10% and
+3% win rate points in summarization and dialogue scenarios, respectively,
without compromising model reasoning or the relevancy and faithfulness of the
summary response.
comment: 27 pages, 9 figures, 5 tables. Accepted to EMNLP 2025
♻ ☆ HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Early-Exit Large Language Models (EE-LLMs) enable high throughput inference
by allowing tokens to exit early at intermediate layers. However, their
throughput is limited by the computational and memory savings. Existing EE-LLM
frameworks rely on a single model and therefore, their token generation
latencies are bottlenecked by tokens that do not exit early and traverse
additional layers. Moreover, early exits are only known at runtime and depend
on the request. Therefore, these frameworks load the weights of all model
layers even though large portions remain unused when tokens exit early. The
lack of memory savings limit us from scaling the batch sizes.
We propose $\textit{HELIOS}$, a framework that improves both token generation
latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits
two insights. $\textit{First}$, early exits are often complimentary across
models, tokens that do not exit early on one model often take an early-exit on
another. HELIOS employs multiple models and dynamically switches between them
to collectively maximize the number of tokens that exit early, and minimize
token generation latencies. $\textit{Second}$, even when a predicted token does
not exit early due to poor confidence, it often remains unchanged even after
additional layer traversal. HELIOS greedily allows such tokens to exit early
and only loads the weights of the most likely to be used layers, yielding
memory savings which is then re-purposed to increase batch sizes. HELIOS
employs real-time profiling to accurately identify the early-exit
distributions, and adaptively switches between models by tracking tokens in
real-time to minimize the performance degradation caused by greedy model
loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$
higher throughput and $15.14\times$ larger batch size compared to existing
EE-LLM frameworks.
♻ ☆ Minitron-SSM: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning
Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
Hybrid LLM architectures that combine Attention and State Space Models (SSMs)
achieve state-of-the-art accuracy and runtime performance. Recent work has
demonstrated that applying compression and distillation to Attention-only
models yields smaller, more accurate models at a fraction of the training cost.
In this work, we explore the effectiveness of compressing Hybrid architectures.
We introduce a novel group-aware pruning strategy that preserves the structural
integrity of SSM blocks and their sequence modeling capabilities. Furthermore,
we demonstrate the necessity of such SSM pruning to achieve improved accuracy
and inference speed compared to traditional approaches. Our compression recipe
combines SSM, FFN, embedding dimension, and layer pruning, followed by
knowledge distillation-based retraining, similar to the MINITRON technique.
Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B
parameters with up to 40x fewer training tokens. The resulting model surpasses
the accuracy of similarly-sized models while achieving 2x faster inference,
significantly advancing the Pareto frontier.
♻ ☆ A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models
Automatically annotating job data with standardized occupations from
taxonomies, known as occupation classification, is crucial for labor market
analysis. However, this task is often hindered by data scarcity and the
challenges of manual annotations. While large language models (LLMs) hold
promise due to their extensive world knowledge and in-context learning
capabilities, their effectiveness depends on their knowledge of occupational
taxonomies, which remains unclear. In this study, we assess the ability of LLMs
to generate precise taxonomic entities from taxonomy, highlighting their
limitations, especially for smaller models. To address these challenges, we
propose a multi-stage framework consisting of inference, retrieval, and
reranking stages, which integrates taxonomy-guided reasoning examples to
enhance performance by aligning outputs with taxonomic knowledge. Evaluations
on a large-scale dataset show that our framework not only enhances occupation
and skill classification tasks, but also provides a cost-effective alternative
to frontier models like GPT-4o, significantly reducing computational costs
while maintaining strong performance. This makes it a practical and scalable
solution for occupation classification and related tasks across LLMs.
comment: Accepted to ICWSM'26
♻ ☆ NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration
Large language models (LLMs) have recently demonstrated the ability to act as
function call agents by invoking external tools, enabling them to solve tasks
beyond their static knowledge. However, existing agents typically call tools
step by step at a time without a global view of task structure. As tools depend
on each other, this leads to error accumulation and limited scalability,
particularly when scaling to thousands of tools. To address these limitations,
we propose NaviAgent, a novel bilevel architecture that decouples task planning
from tool execution through graph-based modeling of the tool ecosystem. At the
task-planning level, the LLM-based agent decides whether to respond directly,
clarify user intent, invoke a toolchain, or execute tool outputs, ensuring
broad coverage of interaction scenarios independent of inter-tool complexity.
At the execution level, a continuously evolving Tool World Navigation Model
(TWNM) encodes structural and behavioral relations among tools, guiding the
agent to generate scalable and robust invocation sequences. By incorporating
feedback from real tool interactions, NaviAgent supports closed-loop
optimization of planning and execution, moving beyond tool calling toward
adaptive navigation of large-scale tool ecosystems. Experiments show that
NaviAgent achieves the best task success rates across models and tasks, and
integrating TWMN further boosts performance by up to 17 points on complex
tasks, underscoring its key role in toolchain orchestration.
♻ ☆ Token Distillation: Attention-aware Input Embeddings For New Tokens
Current language models rely on static vocabularies determined at pretraining
time, which can lead to decreased performance and increased computational cost
for domains underrepresented in the original vocabulary. New tokens can be
added to solve this problem, when coupled with a good initialization for their
new embeddings. However, existing embedding initialization methods require
expensive further training or pretraining of additional modules. In this paper,
we propose Token Distillation and show that by distilling representations
obtained using the original tokenization, we can quickly learn high-quality
input embeddings for new tokens. Experimental results with a wide range of
open-weight models show that Token Distillation outperforms even strong
baselines.
comment: Additional experiments + clearer method name compared to the May 2025
version
♻ ☆ Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents
Frontier LLMs only recently enabled serviceable, autonomous web agents. At
that, a model poses as an instantaneous domain model backend. Ought to suggest
interaction, it is consulted with a web-based task and respective application
state. The key problem lies in application state serialisation - referred to as
snapshot. State-of-the-art web agents are premised on grounded GUI snapshots,
i.e., screenshots enhanced with visual cues. Not least to resemble human
perception, but for images representing relatively cheap means of model input.
LLM vision still lag behind code interpretation capabilities. DOM snapshots,
which structurally resemble HTML, impose a desired alternative. Vast model
input token size, however, disables reliable implementation with web agents to
date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based
on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the
Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots
(67%) matches a grounded GUI snapshot baseline (65%) - within the same input
token order of magnitude (1e3). Our best evaluated configurations - one token
order above, but within the model's context window - outperform this baseline
by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a
strong UI feature for LLMs.
comment: 20 pages, LaTeX; repository URL updated, typos corrected
♻ ☆ Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning
Recent studies have shown that reinforcement learning with verifiable rewards
(RLVR) enhances overall accuracy (pass@1) but often fails to improve capability
(pass@k) of LLMs in reasoning tasks, while distillation can improve both. In
this paper, we investigate the mechanisms behind these phenomena. First, we
demonstrate that RLVR struggles to improve capability as it focuses on
improving the accuracy of the easier questions to the detriment of the accuracy
of the most difficult questions. Second, we show that RLVR does not merely
increase the success probability for the easier questions, but in our small
model settings, produces quality responses that were absent in its original
output distribution. In addition, we show these responses are neither
noticeably longer nor feature more reflection-related keywords, underscoring
the need for more reliable indicators of response quality. Third, from the
experiment distilling teacher responses to in-distribution problems, we find
that capability does not always improve with distillation. We conjecture that
capability improves only when new knowledge is introduced, whereas distilling
reasoning patterns only improves accuracy but not capability, sacrificing
performance on the most difficult questions, similar to RLVR. Together, these
findings offer a clearer understanding of how RLVR and distillation shape
reasoning behavior in LLMs
comment: 25 pages
♻ ☆ More of the Same: Persistent Representational Harms Under Increased Representation NeurIPS
To recognize and mitigate the harms of generative AI systems, it is crucial
to consider whether and how different societal groups are represented by these
systems. A critical gap emerges when naively measuring or improving who is
represented, as this does not consider how people are represented. In this
work, we develop GAS(P), an evaluation methodology for surfacing
distribution-level group representational biases in generated text, tackling
the setting where groups are unprompted (i.e., groups are not specified in the
input to generative systems). We apply this novel methodology to investigate
gendered representations in occupations across state-of-the-art large language
models. We show that, even though the gender distribution when models are
prompted to generate biographies leads to a large representation of women, even
representational biases persist in how different genders are represented. Our
evaluation methodology reveals that there are statistically significant
distribution-level differences in the word choice used to describe biographies
and personas of different genders across occupations, and we show that many of
these differences are associated with representational harms and stereotypes.
Our empirical findings caution that naively increasing (unprompted)
representation may inadvertently proliferate representational biases, and our
proposed evaluation methodology enables systematic and rigorous measurement of
the problem.
comment: Proceedings of the Neural Information Processing Systems (NeurIPS)
2025; 39 pages, 7 figures, 15 tables
♻ ☆ Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
Md. Faiyaz Abdullah Sayeedi, Md. Mahbub Alam, Subhey Sadi Rahman, Md. Adnanul Islam, Jannatul Ferdous Deepti, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda
The rise of Large Language Models (LLMs) has redefined Machine Translation
(MT), enabling context-aware and fluent translations across hundreds of
languages and textual domains. Despite their remarkable capabilities, LLMs
often exhibit uneven performance across language families and specialized
domains. Moreover, recent evidence reveals that these models can encode and
amplify different biases present in their training data, posing serious
concerns for fairness, especially in low-resource languages. To address these
gaps, we introduce Translation Tangles, a unified framework and dataset for
evaluating the translation quality and fairness of open-source LLMs. Our
approach benchmarks 24 bidirectional language pairs across multiple domains
using different metrics. We further propose a hybrid bias detection pipeline
that integrates rule-based heuristics, semantic similarity filtering, and
LLM-based validation. We also introduce a high-quality, bias-annotated dataset
based on human evaluations of 1,439 translation-reference pairs. The code and
dataset are accessible on GitHub:
https://github.com/faiyazabdullah/TranslationTangles
♻ ☆ SMOL: Professionally translated parallel data for 115 under-represented languages
Isaac Caswell, Elizabeth Nielsen, Jiaming Luo, Colin Cherry, Geza Kovacs, Hadar Shemtov, Partha Talukdar, Dinesh Tewari, Baba Mamadi Diane, Djibrila Diane, Solo Farabado Cissé, Koulako Moussa Doumbouya, Edoardo Ferrante, Alessandro Guasoni, Christopher Homan, Mamadou K. Keita, Sudhamoy DebBarma, Ali Kuzhuget, David Anugraha, Muhammad Ravi Shulthan Habibi, Genta Indra Winata, Anthony Munthali, Sina Ahmadi, Andrei Chemyshev, Mingfei Lau, Jonathan Eng
We open-source SMOL (Set of Maximal Overall Leverage), a suite of training
data to unlock machine translation for low-resource languages. SMOL has been
translated into 124 (and growing) under-resourced languages (125 language
pairs), including many for which there exist no previous public resources, for
a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each
carefully chosen for maximum impact given its size: SMOLSENT, a set of
sentences chosen for broad unique token coverage, and SMOLDOC, a document-level
resource focusing on a broad topic coverage. They join the already released
GATITOS for a trifecta of paragraph, sentence, and token-level content. We
demonstrate that using SMOL to prompt or fine-tune Large Language Models yields
robust chrF improvements. In addition to translation, we provide factuality
ratings and rationales for all documents in SMOLDOC, yielding the first
factuality datasets for most of these languages.
comment: ~10 pages with appendices
♻ ☆ Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges EMNLP 2025
As large language models (LLMs) grow more capable, they face increasingly
diverse and complex tasks, making reliable evaluation challenging. The paradigm
of LLMs as judges has emerged as a scalable solution, yet prior work primarily
focuses on simple settings. Their reliability in complex tasks--where
multi-faceted rubrics, unstructured reference answers, and nuanced criteria are
critical--remains understudied. In this paper, we constructed ComplexEval, a
challenge benchmark designed to systematically expose and quantify Auxiliary
Information Induced Biases. We systematically investigated and validated 6
previously unexplored biases across 12 basic and 3 advanced scenarios. Key
findings reveal: (1) all evaluated models exhibit significant susceptibility to
these biases, with bias magnitude scaling with task complexity; (2) notably,
Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth
analysis offers crucial insights for improving the accuracy and verifiability
of evaluation signals, paving the way for more general and robust evaluation
models.
comment: EMNLP 2025 Findings
♻ ☆ Mano Technical Report
Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
Graphical user interfaces (GUIs) are the primary medium for human-computer
interaction, yet automating GUI interactions remains challenging due to the
complexity of visual elements, dynamic environments, and the need for
multi-step reasoning. Existing methods based on vision-language models (VLMs)
often suffer from limited resolution, domain mismatch, and insufficient
sequential decisionmaking capability. To address these issues, we propose Mano,
a robust GUI agent built upon a multi-modal foundation model pre-trained on
extensive web and computer system data. Our approach integrates a novel
simulated environment for high-fidelity data generation, a three-stage training
pipeline (supervised fine-tuning, offline reinforcement learning, and online
reinforcement learning), and a verification module for error recovery. Mano
demonstrates state-of-the-art performance on multiple GUI benchmarks, including
Mind2Web and OSWorld, achieving significant improvements in success rate and
operational accuracy. Our work provides new insights into the effective
integration of reinforcement learning with VLMs for practical GUI agent
deployment, highlighting the importance of domain-specific data, iterative
training, and holistic reward design.
♻ ☆ Multilingual Political Views of Large Language Models: Identification and Steering
Daniil Gurgurov, Katharina Trinley, Ivan Vykopal, Josef van Genabith, Simon Ostermann, Roberto Zamparelli
Large language models (LLMs) are increasingly used in everyday tools and
applications, raising concerns about their potential influence on political
views. While prior research has shown that LLMs often exhibit measurable
political biases--frequently skewing toward liberal or progressive
positions--key gaps remain. Most existing studies evaluate only a narrow set of
models and languages, leaving open questions about the generalizability of
political biases across architectures, scales, and multilingual settings.
Moreover, few works examine whether these biases can be actively controlled.
In this work, we address these gaps through a large-scale study of political
orientation in modern open-source instruction-tuned LLMs. We evaluate seven
models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using
the Political Compass Test with 11 semantically equivalent paraphrases per
statement to ensure robust measurement. Our results reveal that larger models
consistently shift toward libertarian-left positions, with significant
variations across languages and model families. To test the manipulability of
political stances, we utilize a simple center-of-mass activation intervention
technique and show that it reliably steers model responses toward alternative
ideological positions across multiple languages. Our code is publicly available
at https://github.com/d-gurgurov/Political-Ideologies-LLMs.
comment: pre-print
♻ ☆ Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates
Clinical trials are a systematic endeavor to assess the safety and efficacy
of new drugs or treatments. Conducting such trials typically demands
significant financial investment and meticulous planning, highlighting the need
for accurate predictions of trial outcomes. Accurately predicting patient
enrollment, a key factor in trial success, is one of the primary challenges
during the planning phase. In this work, we propose a novel deep learning-based
method to address this critical challenge. Our method, implemented as a neural
network model, leverages pre-trained language models (PLMs) to capture the
complexities and nuances of clinical documents, transforming them into
expressive representations. These representations are then combined with
encoded tabular features via an attention mechanism. To account for
uncertainties in enrollment prediction, we enhance the model with a
probabilistic layer based on the Gamma distribution, which enables range
estimation. We apply the proposed model to predict clinical trial duration,
assuming site-level enrollment follows a Poisson-Gamma process. We carry out
extensive experiments on real-world clinical trial data, and show that the
proposed method can effectively predict the number of patients enrolled at a
number of sites for a given clinical trial, outperforming established baseline
models.
♻ ☆ LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Mixture of experts (MoE) architectures have become a cornerstone for scaling
up and are a key component in most large language models such as GPT-OSS,
DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE
remains severely constrained by the prohibitive computational costs of training
and evaluation, restricting large-scale studies accessible to most researchers.
We introduce LibMoE, a unified framework for reproducible, efficient, and
extensible MoE research that supports both pretraining and sparse-upcycling
regimes. Beyond unified implementations, the framework provides transparent
analytical tools for probing routing and expert dynamics. Leveraging this
foundation, we conduct a comprehensive analysis along three dimensions: (i)
routing dynamics, covering expert selection patterns, routing stability and
optimality, and how routing entropy reveals task specialization and expert
diversity; (ii) the effect of lightweight initialization on load balancing,
demonstrating how subtle changes in router initialization shape early expert
utilization; and (iii) training regime differences, revealing how sparse
upcycling and full pretraining exhibit distinct routing patterns and stability
profiles. By lowering the barrier to entry and standardizing evaluation, along
with our comprehensive analysis, LibMoE broadens access to MoE research and
establishes a reliable benchmark to guide future innovations. Project page:
https://fsoft-aic.github.io/fsoft-LibMoE.github.io.
comment: 15 pages, 9 figures
♻ ☆ Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models NeurIPS 2025
We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning
framework for diffusion language models. DCoLT treats each intermediate step in
the reverse diffusion process as a latent "thinking" action and optimizes the
entire reasoning trajectory to maximize the reward on the correctness of the
final answer with outcome-based Reinforcement Learning (RL). Unlike traditional
Chain-of-Thought (CoT) methods that follow a causal, linear thinking process,
DCoLT allows bidirectional, non-linear reasoning with no strict rule on
grammatical correctness amid its intermediate steps of thought. We implement
DCoLT on two representative Diffusion Language Models (DLMs). First, we choose
SEDD as a representative continuous-time discrete diffusion model, where its
concrete score derives a probabilistic policy to maximize the RL reward over
the entire sequence of intermediate diffusion steps. We further consider the
discrete-time masked diffusion language model -- LLaDA, and find that the order
to predict and unmask tokens plays an essential role to optimize its RL action
resulting from the ranking-based Unmasking Policy Module (UPM) defined by the
Plackett-Luce model. Experiments on both math and code generation tasks show
that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform
other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA
boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH,
MBPP, and HumanEval.
comment: Accepted to NeurIPS 2025. Code link:
https://github.com/maple-research-lab/LLaDOU
♻ ☆ FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing NeurIPS 2025
The rapid advancement of Large Language Models (LLMs) has spurred significant
progress in Large Speech-Language Models (LSLMs), enhancing their capabilities
in both speech understanding and generation. While existing LSLMs often
concentrate on augmenting speech generation or tackling a diverse array of
short-speech tasks, the efficient processing of long-form speech remains a
critical yet underexplored challenge. This gap is primarily attributed to the
scarcity of long-speech training datasets and the high computational costs
associated with long sequences. To address these limitations, we introduce
FastLongSpeech, a novel framework designed to extend LSLM capabilities for
efficient long-speech processing without necessitating dedicated long-speech
training data. FastLongSpeech incorporates an iterative fusion strategy that
can compress excessively long-speech sequences into manageable lengths. To
adapt LSLMs for long-speech inputs, it introduces a dynamic compression
training approach, which exposes the model to short-speech sequences at varying
compression ratios, thereby transferring the capabilities of LSLMs to
long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop
a long-speech understanding benchmark called LongSpeech-Eval. Experiments show
that our method exhibits strong performance in both long-speech and
short-speech tasks, while greatly improving inference efficiency.
comment: NeurIPS 2025. The code is at
https://github.com/ictnlp/FastLongSpeech. This model is at
https://huggingface.co/ICTNLP/FastLongSpeech. The dataset is at
https://huggingface.co/datasets/ICTNLP/LongSpeech-Eval
♻ ☆ Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Reinforcement learning with verifiable rewards (RLVR) has recently enhanced
the reasoning capabilities of large language models (LLMs), particularly for
mathematical problem solving. However, a fundamental limitation remains: as the
sampling budget increases, the advantage of RLVR-trained models over their
pretrained bases often diminishes or even vanishes, revealing a strong
dependence on the base model's restricted search space. We attribute this
phenomenon to the widespread use of the reverse Kullback-Leibler (KL)
divergence regularizer, whose mode-seeking behavior keeps the policy trapped
inside the base model's support region and hampers wider exploration. To
address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an
algorithm to promote broader yet focused exploration. Our method (i) utilizes
the forward KL penalty to replace the reverse KL penalty for
out-of-distribution exploration, and (ii) reweights the reference policy to
facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B
models with RAPO on the 8K SimpleRL-Zero dataset, without supervised
fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO
consistently improves problem-solving performance. Notably, RAPO enables models
to surpass the base model's performance ceiling and solves previously
intractable problems, advancing the frontier of RLVR for challenging reasoning
tasks.
♻ ☆ Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations
Although mathematics is often considered culturally neutral, the way
mathematical problems are presented can carry implicit cultural context.
Existing benchmarks like GSM8K are predominantly rooted in Western norms,
including names, currencies, and everyday scenarios. In this work, we create
culturally adapted variants of the GSM8K test set for five regions Africa,
India, China, Korea, and Japan using prompt-based transformations followed by
manual verification. We evaluate six large language models (LLMs), ranging from
8B to 72B parameters, across five prompting strategies to assess their
robustness to cultural variation in math problem presentation. Our findings
reveal a consistent performance gap: models perform best on the original
US-centric dataset and comparatively worse on culturally adapted versions.
However, models with reasoning capabilities are more resilient to these shifts,
suggesting that deeper reasoning helps bridge cultural presentation gaps in
mathematical tasks
♻ ☆ Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives EMNLP 2025
Normative reasoning is a type of reasoning that involves normative or deontic
modality, such as obligation and permission. While large language models (LLMs)
have demonstrated remarkable performance across various reasoning tasks, their
ability to handle normative reasoning remains underexplored. In this paper, we
systematically evaluate LLMs' reasoning capabilities in the normative domain
from both logical and modal perspectives. Specifically, to assess how well LLMs
reason with normative modals, we make a comparison between their reasoning with
normative modals and their reasoning with epistemic modals, which share a
common formal structure. To this end, we introduce a new dataset covering a
wide range of formal patterns of reasoning in both normative and epistemic
domains, while also incorporating non-formal cognitive factors that influence
human reasoning. Our results indicate that, although LLMs generally adhere to
valid reasoning patterns, they exhibit notable inconsistencies in specific
types of normative reasoning and display cognitive biases similar to those
observed in psychological studies of human reasoning. These findings highlight
challenges in achieving logical consistency in LLMs' normative reasoning and
provide insights for enhancing their reliability. All data and code are
released publicly at https://github.com/kmineshima/NeuBAROCO.
comment: Accepted to the 8th BlackboxNLP Workshop at EMNLP 2025
♻ ☆ AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy NeurIPS 2025
Sebastian Antony Joseph, Syed Murtaza Husain, Stella S. R. Offner, Stéphanie Juneau, Paul Torrey, Adam S. Bolton, Juan P. Farias, Niall Gaffney, Greg Durrett, Junyi Jessy Li
Large Language Models (LLMs) are being explored for applications in
scientific research, including their capabilities to synthesize literature,
answer research questions, generate research ideas, and even conduct
computational experiments. Ultimately, our goal is for these to help scientists
derive novel scientific insights. In many areas of science, such insights often
arise from processing and visualizing data to understand its patterns. However,
evaluating whether an LLM-mediated scientific workflow produces outputs
conveying the correct scientific insights is challenging to evaluate and has
not been addressed in past work. We introduce AstroVisBench, the first
benchmark for both scientific computing and visualization in the astronomy
domain. AstroVisBench judges a language model's ability to both (1) create
astronomy-specific workflows to process and analyze data and (2) visualize the
results of these workflows through complex plots. Our evaluation of
visualizations uses a novel LLM-as-a-judge workflow, which is validated against
annotation by five professional astronomers. Using AstroVisBench we present an
evaluation of state-of-the-art language models, showing a significant gap in
their ability to engage in astronomy research as useful assistants. This
evaluation provides a strong end-to-end evaluation for AI scientists that
offers a path forward for the development of visualization-based workflows,
which are central to a broad range of domains from physics to biology.
comment: Accepted at NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ R$^2$ec: Towards Large Recommender Models with Reasoning
Large recommender models have extended LLMs as powerful recommenders via
encoding or item generation, and recent breakthroughs in LLM reasoning
synchronously motivate the exploration of reasoning in recommendation. In this
work, we propose R$^2$ec, a unified large recommender model with intrinsic
reasoning capability. R$^2$ec introduces a dual-head architecture that supports
both reasoning chain generation and efficient item prediction in a single
model, significantly reducing inference latency. To overcome the lack of
annotated reasoning data, we design RecPO, a reinforcement learning framework
that optimizes reasoning and recommendation jointly with a novel fused reward
mechanism. Extensive experiments on three datasets demonstrate that R$^2$ec
outperforms traditional, LLM-based, and reasoning-augmented recommender
baselines, while further analyses validate its competitive efficiency among
conventional LLM-based recommender baselines and strong adaptability to diverse
recommendation scenarios. Code and checkpoints available at
https://github.com/YRYangang/RRec.
comment: Accepted by Neurips 2025
♻ ☆ KAT-Coder Technical Report
Zizheng Zhan, Ken Deng, Jinghui Wang, Xiaojiang Zhang, Huaixi Tang, Minglei Zhang, Zhiyi Lai, Haoyang Huang, Wen Xiang, Kun Wu, Wenhao Zhuang, Shaojie Wang, Shangpeng Yan, Kepeng Lei, Zongxian Feng, Huiming Wang, Zheng Lin, Mengtong Li, Mengfei Xie, Yinghan Cui, Xuxing Chen, Chao Wang, Weihao Li, Wenqiang Zhu, Jiarong Zhang, Jingxuan Xu, Songwei Yu, Yifan Yao, Xinping Lei, C. Zhang, Han Li, Junqi Xiong, Zuchen Gao, Dailin Li, Haimo Li, Jiaheng Liu, Yuqun Zhang, Junyi Peng, Haotian Zhang, Bin Chen
Recent advances in large language models (LLMs) have enabled progress in
agentic coding, where models autonomously reason, plan, and act within
interactive software development workflows. However, bridging the gap between
static text-based training and dynamic real-world agentic execution remains a
core challenge. In this technical report, we present KAT-Coder, a large-scale
agentic code model trained through a multi-stage curriculum encompassing
Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning
(RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances
reasoning, planning, and reflection capabilities through a corpus of real
software engineering data and synthetic agentic interactions. The SFT stage
constructs a million-sample dataset balancing twenty programming languages, ten
development contexts, and ten task archetypes. The RFT stage introduces a novel
multi-ground-truth reward formulation for stable and sample-efficient policy
optimization. Finally, the Reinforcement-to-Deployment phase adapts the model
to production-grade IDE environments using Error-Masked SFT and Tree-Structured
Trajectory Training. In summary, these stages enable KAT-Coder to achieve
robust tool-use reliability, instruction alignment, and long-context reasoning,
forming a deployable foundation for real-world intelligent coding agents. Our
KAT series 32B model, KAT-Dev, has been open-sourced on
https://huggingface.co/Kwaipilot/KAT-Dev.
♻ ☆ Training a Generally Curious Agent ICML 2025
Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Sadia Rahman, J Zico Kolter, Jeff Schneider, Ruslan Salakhutdinov
Efficient exploration is essential for intelligent systems interacting with
their environment, but existing language models often fall short in scenarios
that require strategic information gathering. In this paper, we present
Paprika, a fine-tuning approach that enables language models to develop general
decision-making capabilities that are not confined to particular environments.
By training on synthetic interaction data from different tasks that require
diverse strategies, Paprika teaches models to explore and adapt their behavior
on a new task based on environment feedback in-context without more gradient
updates. Experimental results show that models fine-tuned with Paprika can
effectively transfer their learned decision-making capabilities to entirely
unseen tasks without additional training. Unlike traditional training, our
approach's primary bottleneck lies in sampling useful interaction data instead
of model updates. To improve sample efficiency, we propose a curriculum
learning strategy that prioritizes sampling trajectories from tasks with high
learning potential. These results suggest a promising path towards AI systems
that can autonomously solve novel sequential decision-making problems that
require interactions with the external world.
comment: ICML 2025. Project Website: https://paprika-llm.github.io
♻ ☆ Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Vision-Language Models (VLMs) often struggle to balance visual and textual
information when summarizing complex multimodal inputs, such as entire TV show
episodes. In this paper, we propose a zero-shot video-to-text summarization
approach that builds its own screenplay representation of an episode,
effectively integrating key video moments, dialogue, and character information
into a unified document. Unlike previous approaches, we simultaneously generate
screenplays and name the characters in zero-shot, using only the audio, video,
and transcripts as input. Additionally, we highlight that existing
summarization metrics can fail to assess the multimodal content in summaries.
To address this, we introduce MFactSum, a multimodal metric that evaluates
summaries with respect to both vision and text modalities. Using MFactSum, we
evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating
superiority against state-of-the-art VLMs such as Gemini 1.5 by generating
summaries containing 20% more relevant visual information while requiring 75%
less of the video as input.
♻ ☆ HiRA: A Hierarchical Reasoning Framework for Decoupled Planning and Execution in Deep Search
Complex information needs in real-world search scenarios demand deep
reasoning and knowledge synthesis across diverse sources, which traditional
retrieval-augmented generation (RAG) pipelines struggle to address effectively.
Current reasoning-based approaches suffer from a fundamental limitation: they
use a single model to handle both high-level planning and detailed execution,
leading to inefficient reasoning and limited scalability. In this paper, we
introduce HiRA, a hierarchical framework that separates strategic planning from
specialized execution. Our approach decomposes complex search tasks into
focused subtasks, assigns each subtask to domain-specific agents equipped with
external tools and reasoning capabilities, and coordinates the results through
a structured integration mechanism. This separation prevents execution details
from disrupting high-level reasoning while enabling the system to leverage
specialized expertise for different types of information processing.
Experiments on four complex, cross-modal deep search benchmarks demonstrate
that HiRA significantly outperforms state-of-the-art RAG and agent-based
systems. Our results show improvements in both answer quality and system
efficiency, highlighting the effectiveness of decoupled planning and execution
for multi-step information seeking tasks. Our code is available at
https://github.com/ignorejjj/HiRA.
comment: 9 pages
♻ ☆ E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
Text embedding models serve as a fundamental component in real-world search
applications. By mapping queries and documents into a shared embedding space,
they deliver competitive retrieval performance with high efficiency. However,
their ranking fidelity remains limited compared to dedicated rerankers,
especially recent LLM-based listwise rerankers, which capture fine-grained
query-document and document-document interactions. In this paper, we propose a
simple yet effective unified framework E2Rank, means Efficient Embedding-based
Ranking (also means Embedding-to-Rank), which extends a single text embedding
model to perform both high-quality retrieval and listwise reranking through
continued training under a listwise ranking objective, thereby achieving strong
effectiveness with remarkable efficiency. By applying cosine similarity between
the query and document embeddings as a unified ranking function, the listwise
ranking prompt, which is constructed from the original query and its candidate
documents, serves as an enhanced query enriched with signals from the top-K
documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval
models. This design preserves the efficiency and representational quality of
the base embedding model while significantly improving its reranking
performance. Empirically, E2Rank achieves state-of-the-art results on the BEIR
reranking benchmark and demonstrates competitive performance on the
reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also
show that the ranking training process improves embedding performance on the
MTEB benchmark. Our findings indicate that a single embedding model can
effectively unify retrieval and reranking, offering both computational
efficiency and competitive ranking accuracy.
comment: Code and models are avaliable at https://alibaba-nlp.github.io/E2Rank
♻ ☆ Multilingual State Space Models for Structured Question Answering in Indic Languages NAACL 2025
The diversity and complexity of Indic languages present unique challenges for
natural language processing (NLP) tasks, particularly in the domain of question
answering (QA).To address these challenges, this paper explores the application
of State Space Models (SSMs),to build efficient and contextually aware QA
systems tailored for Indic languages. SSMs are particularly suited for this
task due to their ability to model long-term and short-term dependencies in
sequential data, making them well-equipped to handle the rich morphology,
complex syntax, and contextual intricacies characteristic of Indian languages.
We evaluated multiple SSM architectures across diverse datasets representing
various Indic languages and conducted a comparative analysis of their
performance. Our results demonstrate that these models effectively capture
linguistic subtleties, leading to significant improvements in question
interpretation, context alignment, and answer generation. This work represents
the first application of SSMs to question answering tasks in Indic languages,
establishing a foundational benchmark for future research in this domain. We
propose enhancements to existing SSM frameworks, optimizing their applicability
to low-resource settings and multilingual scenarios prevalent in Indic
languages.
comment: NAACL 2025 Workshop on Technologies for Machine Translation of
Low-Resource Languages (LoResMT)
♻ ☆ DiagramEval: Evaluating LLM-Generated Diagrams via Graphs EMNLP 2025
Diagrams play a central role in research papers for conveying ideas, yet they
are often notoriously complex and labor-intensive to create. Although diagrams
are presented as images, standard image generative models struggle to produce
clear diagrams with well-defined structure. We argue that a promising direction
is to generate demonstration diagrams directly in textual form as SVGs, which
can leverage recent advances in large language models (LLMs). However, due to
the complexity of components and the multimodal nature of diagrams,
sufficiently discriminative and explainable metrics for evaluating the quality
of LLM-generated diagrams remain lacking. In this paper, we propose
DiagramEval, a novel evaluation metric designed to assess demonstration
diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams
as graphs, treating text elements as nodes and their connections as directed
edges, and evaluates diagram quality using two new groups of metrics: node
alignment and path alignment. For the first time, we effectively evaluate
diagrams produced by state-of-the-art LLMs on recent research literature,
quantitatively demonstrating the validity of our metrics. Furthermore, we show
how the enhanced explainability of our proposed metrics offers valuable
insights into the characteristics of LLM-generated diagrams. Code:
https://github.com/ulab-uiuc/diagram-eval.
comment: EMNLP 2025 Main
♻ ☆ FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages NAACL 2025
This paper presents the winning submission of the RaaVa team to the
AmericasNLP 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine
Translation (MT) into Indigenous Languages of America, where our system ranked
first overall based on average Pearson correlation with the human annotations.
We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge
regression and Gradient Boosting to model translation quality. In addition to
FUSE, we explore five alternative approaches leveraging different combinations
of linguistic similarity features and learning paradigms. FUSE Score highlights
the effectiveness of combining lexical, phonetic, semantic, and fuzzy token
similarity with learning-based modeling to improve MT evaluation for
morphologically rich and low-resource languages. MT into Indigenous languages
poses unique challenges due to polysynthesis, complex morphology, and
non-standardized orthography. Conventional automatic metrics such as BLEU, TER,
and ChrF often fail to capture deeper aspects like semantic adequacy and
fluency. Our proposed framework, formerly referred to as FUSE, incorporates
multilingual sentence embeddings and phonological encodings to better align
with human evaluation. We train supervised models on human-annotated
development sets and evaluate held-out test data. Results show that FUSE
consistently achieves higher Pearson and Spearman correlations with human
judgments, offering a robust and linguistically informed solution for MT
evaluation in low-resource settings.
comment: NAACL 2025 Workshop on NLP for Indigenous Languages of the Americas
♻ ☆ MindSearch: Mimicking Human Minds Elicits Deep AI Searcher ICLR2025
Information seeking and integration is a complex cognitive task that consumes
enormous time and effort. Inspired by the remarkable progress of Large Language
Models, recent works attempt to solve this task by combining LLMs and search
engines. However, these methods still obtain unsatisfying performance due to
three challenges: (1) complex requests often cannot be accurately and
completely retrieved by the search engine once (2) corresponding information to
be integrated is spread over multiple web pages along with massive noise, and
(3) a large number of web pages with long contents may quickly exceed the
maximum context length of LLMs. Inspired by the cognitive process when humans
solve these problems, we introduce MindSearch to mimic the human minds in web
information seeking and integration, which can be instantiated by a simple yet
effective LLM-based multi-agent framework. The WebPlanner models the human mind
of multi-step information seeking as a dynamic graph construction process: it
decomposes the user query into atomic sub-questions as nodes in the graph and
progressively extends the graph based on the search result from WebSearcher.
Tasked with each sub-question, WebSearcher performs hierarchical information
retrieval with search engines and collects valuable information for WebPlanner.
The multi-agent design of MindSearch enables the whole framework to seek and
integrate information parallelly from larger-scale (e.g., more than 300) web
pages in 3 minutes, which is worth 3 hours of human effort. MindSearch
demonstrates significant improvement in the response quality in terms of depth
and breadth, on both close-set and open-set QA problems. Besides, responses
from MindSearch based on InternLM2.5-7B are preferable by humans to ChatGPT-Web
and Perplexity.ai applications, which implies that MindSearch can already
deliver a competitive solution to the proprietary AI search engine.
comment: ICLR2025. Project Page: https://mindsearch.netlify.app Code:
https://github.com/InternLM/MindSearch
♻ ☆ Prompt-MII: Meta-Learning Instruction Induction for LLMs
A popular method to adapt large language models (LLMs) to new tasks is
in-context learning (ICL), which is effective but incurs high inference costs
as context length grows. In this paper we propose a method to perform
instruction induction, where we take training examples and reduce them to a
compact but descriptive prompt that can achieve performance comparable to ICL
over the full training set. Specifically, we propose PROMPT-MII, a
reinforcement learning (RL) based framework to meta-learn an instruction
induction model that can generate compact instructions on the fly for an
arbitrary new dataset. We train on over 3,000 diverse classification datasets
from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves
downstream model quality by 4-9 F1 points (10-20% relative), matching ICL
performance while requiring 3-13x fewer tokens.
♻ ☆ NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium EuroSys'26
AI accelerators, customized to AI workloads, provide cost-effective and
high-performance solutions for training and inference. Trainium, an AI
accelerator recently developed by Amazon Web Services (AWS), provides an
attractive option for LLM training and inference through its heterogeneous
architecture. However, leveraging Trainium architecture for high performance
can be challenging because of its systolic array architecture and special
requirement on data layout. In this paper, we design high-performance matrix
multiplication (matmul), a critical compute kernel, for LLM inference on
Trainium. We introduce a series of techniques customized to Trainium based on
kernel fusion and novel caching strategies to reduce data movement across the
software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive
matrix transpose. Evaluating with nine datasets and four recent LLMs, we show
that our system largely outperforms the state-of-the-art matmul implemented by
AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x
speedup (up to 2.22x), which translates to an average 1.66x speedup (up to
2.49x) for end-to-end LLM inference.
comment: 12 pages, 8 figures, submitted to the Proceedings of the Twenty-First
European Conference on Computer Systems (EuroSys'26)
♻ ☆ From Memorization to Reasoning in the Spectrum of Loss Curvature
We characterize how memorization is represented in transformer models and
show that it can be disentangled in the weights of both language models (LMs)
and vision transformers (ViTs) using a decomposition based on the loss
landscape curvature. This insight is based on prior theoretical and empirical
work showing that the curvature for memorized training points is much sharper
than non memorized, meaning ordering weight components from high to low
curvature can reveal a distinction without explicit labels. This motivates a
weight editing procedure that suppresses far more recitation of untargeted
memorized data more effectively than a recent unlearning method
(BalancedSubnet), while maintaining lower perplexity. Since the basis of
curvature has a natural interpretation for shared structure in model weights,
we analyze the editing procedure extensively on its effect on downstream tasks
in LMs, and find that fact retrieval and arithmetic are specifically and
consistently negatively affected, even though open book fact retrieval and
general logical reasoning is conserved. We posit these tasks rely heavily on
specialized directions in weight space rather than general purpose mechanisms,
regardless of whether those individual datapoints are memorized. We support
this by showing a correspondence between task data's activation strength with
low curvature components that we edit out, and the drop in task performance
after the edit. Our work enhances the understanding of memorization in neural
networks with practical applications towards removing it, and provides evidence
for idiosyncratic, narrowly-used structures involved in solving tasks like math
and fact retrieval.