Computation and Language 73
☆ Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently
demonstrated notable success in enhancing the reasoning capabilities of LLMs,
particularly in mathematics and programming tasks. It is widely believed that
RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning
abilities that exceed corresponding base models' capacity. In this study,
however, we critically re-examines this assumption by measuring the
pass@\textit{k} metric with large values of \textit{k} to explore the reasoning
capability boundary of the models across a wide range of model families and
benchmarks. Surprisingly, the RL does \emph{not}, in fact, elicit fundamentally
new reasoning patterns. While RL-trained models outperform their base models at
smaller values of $k$ (\eg, $k$=1), base models can achieve a comparable or
even higher pass@$k$ score compared to their RL counterparts at large $k$
values. The reasoning paths generated by RL-trained models are already included
in the base models' sampling distribution, suggesting that most reasoning
abilities manifested in RL-trained models are already obtained by base models.
Further analysis shows that RL training boosts the performance by biasing the
model's output distribution toward paths that are more likely to yield rewards,
therefore sampling correct responses more efficiently. But this also results in
a narrower reasoning capability boundary compared to base models. Similar
results are observed in visual reasoning tasks trained with RLVR. Moreover, we
find that distillation can genuinely introduce new knowledge into the model,
different from RLVR. These findings underscore a critical limitation of RLVR in
advancing LLM reasoning abilities which requires us to fundamentally rethink
the impact of RL training in reasoning LLMs and the need of a better paradigm.
Project Page: https://limit-of-RLVR.github.io
comment: 24 pages, 19 figures
☆ MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Data quality and diversity are key to the construction of effective
instruction-tuning datasets. % With the increasing availability of open-source
instruction-tuning datasets, it is advantageous to automatically select
high-quality and diverse subsets from a vast amount of data. % Existing methods
typically prioritize instance quality and use heuristic rules to maintain
diversity. % However, this absence of a comprehensive view of the entire
collection often leads to suboptimal results. % Moreover, heuristic rules
generally focus on distance or clustering within the embedding space, which
fails to accurately capture the intent of complex instructions in the semantic
space. % To bridge this gap, we propose a unified method for quantifying the
information content of datasets. This method models the semantic space by
constructing a label graph and quantifies diversity based on the distribution
of information within the graph. % Based on such a measurement, we further
introduce an efficient sampling method that selects data samples iteratively to
\textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic
space. % Experiments on various datasets and base models demonstrate that MIG
consistently outperforms state-of-the-art methods. % Notably, the model
fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance
to the official SFT model trained on the full dataset, with improvements of
+5.73\% on AlpacaEval and +6.89\% on Wildbench.
☆ Science Hierarchography: Hierarchical Organization of Science Literature
Scientific knowledge is growing rapidly, making it challenging to track
progress and high-level conceptual links across broad disciplines. While
existing tools like citation networks and search engines make it easy to access
a few related papers, they fundamentally lack the flexible abstraction needed
to represent the density of activity in various scientific subfields. We
motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature
into a high-quality hierarchical structure that allows for the categorization
of scientific work across varying levels of abstraction, from very broad fields
to very specific studies. Such a representation can provide insights into which
fields are well-explored and which are under-explored. To achieve the goals of
SCIENCE HIERARCHOGRAPHY, we develop a range of algorithms. Our primary approach
combines fast embedding-based clustering with LLM-based prompting to balance
the computational efficiency of embedding methods with the semantic precision
offered by LLM prompting. We demonstrate that this approach offers the best
trade-off between quality and speed compared to methods that heavily rely on
LLM prompting, such as iterative tree construction with LLMs. To better reflect
the interdisciplinary and multifaceted nature of research papers, our hierarchy
captures multiple dimensions of categorization beyond simple topic labels. We
evaluate the utility of our framework by assessing how effectively an LLM-based
agent can locate target papers using the hierarchy. Results show that this
structured approach enhances interpretability, supports trend discovery, and
offers an alternative pathway for exploring scientific literature beyond
traditional search methods. Code, data and demo:
$\href{https://github.com/JHU-CLSP/science-hierarchography}{https://github.com/JHU-CLSP/science-hierarchography}$
☆ Generative AI Act II: Test Time Scaling Drives Cognition Engineering
Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu
The first generation of Large Language Models - what might be called "Act I"
of generative AI (2020-2023) - achieved remarkable success through massive
parameter and data scaling, yet exhibited fundamental limitations in knowledge
latency, shallow reasoning, and constrained cognitive processes. During this
era, prompt engineering emerged as our primary interface with AI, enabling
dialogue-level communication through natural language. We now witness the
emergence of "Act II" (2024-present), where models are transitioning from
knowledge-retrieval systems (in latent space) to thought-construction engines
through test-time scaling techniques. This new paradigm establishes a
mind-level connection with AI through language-based thoughts. In this paper,
we clarify the conceptual foundations of cognition engineering and explain why
this moment is critical for its development. We systematically break down these
advanced approaches through comprehensive tutorials and optimized
implementations, democratizing access to cognition engineering and enabling
every practitioner to participate in AI's second act. We provide a regularly
updated collection of papers on test-time scaling in the GitHub Repository:
https://github.com/GAIR-NLP/cognition-engineering
☆ Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models
Junjie Yang, Junhao Song, Xudong Han, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Yichao Zhang, Qian Niu, Benji Peng, Keyu Chen, Ming Liu
Knowledge distillation (KD) is a technique for transferring knowledge from
complex teacher models to simpler student models, significantly enhancing model
efficiency and accuracy. It has demonstrated substantial advancements in
various applications including image classification, object detection, language
modeling, text classification, and sentiment analysis. Recent innovations in KD
methods, such as attention-based approaches, block-wise logit distillation, and
decoupling distillation, have notably improved student model performance. These
techniques focus on stimulus complexity, attention mechanisms, and global
information capture to optimize knowledge transfer. In addition, KD has proven
effective in compressing large language models while preserving accuracy,
reducing computational overhead, and improving inference speed. This survey
synthesizes the latest literature, highlighting key findings, contributions,
and future directions in knowledge distillation to provide insights for
researchers and practitioners on its evolving role in artificial intelligence
and machine learning.
☆ Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing
reasoning capabilities in large language models, but faces a fundamental
asymmetry in computation and memory requirements: inference is embarrassingly
parallel with a minimal memory footprint, while policy updates require
extensive synchronization and are memory-intensive. To address this asymmetry,
we introduce PODS (Policy Optimization with Down-Sampling), a framework that
strategically decouples these phases by generating numerous rollouts in
parallel but updating only on an informative subset. Within this framework, we
develop max-variance down-sampling, a theoretically motivated method that
selects rollouts with maximally diverse reward signals. We prove that this
approach has an efficient algorithmic solution, and empirically demonstrate
that GRPO with PODS using max-variance down-sampling achieves superior
performance over standard GRPO on the GSM8K benchmark.
comment: 9 pages, 1 figure
☆ Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
While understanding the knowledge boundaries of LLMs is crucial to prevent
hallucination, research on knowledge boundaries of LLMs has predominantly
focused on English. In this work, we present the first study to analyze how
LLMs recognize knowledge boundaries across different languages by probing their
internal representations when processing known and unknown questions in
multiple languages. Our empirical studies reveal three key findings: 1) LLMs'
perceptions of knowledge boundaries are encoded in the middle to middle-upper
layers across different languages. 2) Language differences in knowledge
boundary perception follow a linear structure, which motivates our proposal of
a training-free alignment method that effectively transfers knowledge boundary
perception ability across languages, thereby helping reduce hallucination risk
in low-resource languages; 3) Fine-tuning on bilingual question pair
translation further enhances LLMs' recognition of knowledge boundaries across
languages. Given the absence of standard testbeds for cross-lingual knowledge
boundary analysis, we construct a multilingual evaluation suite comprising
three representative types of knowledge boundary data. Our code and datasets
are publicly available at
https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.
☆ BadApex: Backdoor Attack Based on Adaptive Optimization Mechanism of Black-box Large Language Models
Previous insertion-based and paraphrase-based backdoors have achieved great
success in attack efficacy, but they ignore the text quality and semantic
consistency between poisoned and clean texts. Although recent studies introduce
LLMs to generate poisoned texts and improve the stealthiness, semantic
consistency, and text quality, their hand-crafted prompts rely on expert
experiences, facing significant challenges in prompt adaptability and attack
performance after defenses. In this paper, we propose a novel backdoor attack
based on adaptive optimization mechanism of black-box large language models
(BadApex), which leverages a black-box LLM to generate poisoned text through a
refined prompt. Specifically, an Adaptive Optimization Mechanism is designed to
refine an initial prompt iteratively using the generation and modification
agents. The generation agent generates the poisoned text based on the initial
prompt. Then the modification agent evaluates the quality of the poisoned text
and refines a new prompt. After several iterations of the above process, the
refined prompt is used to generate poisoned texts through LLMs. We conduct
extensive experiments on three dataset with six backdoor attacks and two
defenses. Extensive experimental results demonstrate that BadApex significantly
outperforms state-of-the-art attacks. It improves prompt adaptability, semantic
consistency, and text quality. Furthermore, when two defense methods are
applied, the average attack success rate (ASR) still up to 96.75%.
comment: 16 pages, 6 figures
☆ Scaling sparse feature circuit finding for in-context learning
Sparse autoencoders (SAEs) are a popular tool for interpreting large language
model activations, but their utility in addressing open questions in
interpretability remains unclear. In this work, we demonstrate their
effectiveness by using SAEs to deepen our understanding of the mechanism behind
in-context learning (ICL). We identify abstract SAE features that (i) encode
the model's knowledge of which task to execute and (ii) whose latent vectors
causally induce the task zero-shot. This aligns with prior work showing that
ICL is mediated by task vectors. We further demonstrate that these task vectors
are well approximated by a sparse sum of SAE latents, including these
task-execution features. To explore the ICL mechanism, we adapt the sparse
feature circuits methodology of Marks et al. (2024) to work for the much larger
Gemma-1 2B model, with 30 times as many parameters, and to the more complex
task of ICL. Through circuit finding, we discover task-detecting features with
corresponding SAE latents that activate earlier in the prompt, that detect when
tasks have been performed. They are causally linked with task-execution
features through the attention and MLP sublayers.
☆ Learning to Attribute with Attention
Given a sequence of tokens generated by a language model, we may want to
identify the preceding tokens that influence the model to generate this
sequence. Performing such token attribution is expensive; a common approach is
to ablate preceding tokens and directly measure their effects. To reduce the
cost of token attribution, we revisit attention weights as a heuristic for how
a language model uses previous tokens. Naive approaches to attribute model
behavior with attention (e.g., averaging attention weights across attention
heads to estimate a token's influence) have been found to be unreliable. To
attain faithful attributions, we propose treating the attention weights of
different attention heads as features. This way, we can learn how to
effectively leverage attention weights for attribution (using signal from
ablations). Our resulting method, Attribution with Attention (AT2), reliably
performs on par with approaches that involve many ablations, while being
significantly more efficient. To showcase the utility of AT2, we use it to
prune less important parts of a provided context in a question answering
setting, improving answer quality. We provide code for AT2 at
https://github.com/MadryLab/AT2 .
☆ Controlled Territory and Conflict Tracking (CONTACT): (Geo-)Mapping Occupied Territory from Open Source Intelligence
Open-source intelligence provides a stream of unstructured textual data that
can inform assessments of territorial control. We present CONTACT, a framework
for territorial control prediction using large language models (LLMs) and
minimal supervision. We evaluate two approaches: SetFit, an embedding-based
few-shot classifier, and a prompt tuning method applied to BLOOMZ-560m, a
multilingual generative LLM. Our model is trained on a small hand-labeled
dataset of news articles covering ISIS activity in Syria and Iraq, using
prompt-conditioned extraction of control-relevant signals such as military
operations, casualties, and location references. We show that the BLOOMZ-based
model outperforms the SetFit baseline, and that prompt-based supervision
improves generalization in low-resource settings. CONTACT demonstrates that
LLMs fine-tuned using few-shot methods can reduce annotation burdens and
support structured inference from open-ended OSINT streams. Our code is
available at https://github.com/PaulKMandal/CONTACT/.
comment: 7 pages, 1 figure, 1 table
☆ OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
As the general capabilities of large language models (LLMs) improve and agent
applications become more widespread, the underlying deception risks urgently
require systematic evaluation and effective oversight. Unlike existing
evaluation which uses simulated games or presents limited choices, we introduce
OpenDeception, a novel deception evaluation framework with an open-ended
scenario dataset. OpenDeception jointly evaluates both the deception intention
and capabilities of LLM-based agents by inspecting their internal reasoning
process. Specifically, we construct five types of common use cases where LLMs
intensively interact with the user, each consisting of ten diverse, concrete
scenarios from the real world. To avoid ethical concerns and costs of high-risk
deceptive interactions with human testers, we propose to simulate the
multi-turn dialogue via agent simulation. Extensive evaluation of eleven
mainstream LLMs on OpenDeception highlights the urgent need to address
deception risks and security concerns in LLM-based agents: the deception
intention ratio across the models exceeds 80%, while the deception success rate
surpasses 50%. Furthermore, we observe that LLMs with stronger capabilities do
exhibit a higher risk of deception, which calls for more alignment efforts on
inhibiting deceptive behaviors.
☆ Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results
Andrea Santilli, Adam Golinski, Michael Kirchhof, Federico Danieli, Arno Blaas, Miao Xiong, Luca Zappella, Sinead Williamson
Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for
improving their safety and reliability. Evaluations often use performance
metrics like AUROC to assess how well UQ methods (e.g., negative sequence
probabilities) correlate with task correctness functions (e.g., ROUGE-L). In
this paper, we show that commonly used correctness functions bias UQ
evaluations by inflating the performance of certain UQ methods. We evaluate 7
correctness functions -- from lexical-based and embedding-based metrics to
LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our
analysis reveals that length biases in the errors of these correctness
functions distort UQ assessments by interacting with length biases in UQ
methods. We identify LLM-as-a-judge approaches as among the least length-biased
choices and hence a potential solution to mitigate these biases.
☆ Large Language Models Will Change The Way Children Think About Technology And Impact Every Interaction Paradigm
This paper presents a hopeful perspective on the potentially dramatic impacts
of Large Language Models on how we children learn and how they will expect to
interact with technology. We review the effects of LLMs on education so far,
and make the case that these effects are minor compared to the upcoming changes
that are occurring. We present a small scenario and self-ethnographic study
demonstrating the effects of these changes, and define five significant
considerations that interactive systems designers will have to accommodate in
the future.
comment: Accepted for IDC 2025. Citation: Russell Beale. 2025. Large Language
Models Will Change The Way Children Think About Technology And Impact Every
Interaction Paradigm. In Proceedings of Interaction Design and Children
Conference (IDC2025). ACM, New York, NY, USA
☆ Multi-Type Context-Aware Conversational Recommender Systems via Mixture-of-Experts
Conversational recommender systems enable natural language conversations and
thus lead to a more engaging and effective recommendation scenario. As the
conversations for recommender systems usually contain limited contextual
information, many existing conversational recommender systems incorporate
external sources to enrich the contextual information. However, how to combine
different types of contextual information is still a challenge. In this paper,
we propose a multi-type context-aware conversational recommender system, called
MCCRS, effectively fusing multi-type contextual information via
mixture-of-experts to improve conversational recommender systems. MCCRS
incorporates both structured information and unstructured information,
including the structured knowledge graph, unstructured conversation history,
and unstructured item reviews. It consists of several experts, with each expert
specialized in a particular domain (i.e., one specific contextual information).
Multiple experts are then coordinated by a ChairBot to generate the final
results. Our proposed MCCRS model takes advantage of different contextual
information and the specialization of different experts followed by a ChairBot
breaks the model bottleneck on a single contextual information. Experimental
results demonstrate that our proposed MCCRS method achieves significantly
higher performance compared to existing baselines.
comment: 30 pages
☆ Word Embedding Techniques for Classification of Star Ratings
Telecom services are at the core of today's societies' everyday needs. The
availability of numerous online forums and discussion platforms enables telecom
providers to improve their services by exploring the views of their customers
to learn about common issues that the customers face. Natural Language
Processing (NLP) tools can be used to process the free text collected.
One way of working with such data is to represent text as numerical vectors
using one of many word embedding models based on neural networks. This research
uses a novel dataset of telecom customers' reviews to perform an extensive
study showing how different word embedding algorithms can affect the text
classification process. Several state-of-the-art word embedding techniques are
considered, including BERT, Word2Vec and Doc2Vec, coupled with several
classification algorithms. The important issue of feature engineering and
dimensionality reduction is addressed and several PCA-based approaches are
explored. Moreover, the energy consumption used by the different word
embeddings is investigated. The findings show that some word embedding models
can lead to consistently better text classifiers in terms of precision, recall
and F1-Score. In particular, for the more challenging classification tasks,
BERT combined with PCA stood out with the highest performance metrics.
Moreover, our proposed PCA approach of combining word vectors using the first
principal component shows clear advantages in performance over the traditional
approach of taking the average.
comment: 40 pages
☆ Exploring the Potential for Large Language Models to Demonstrate Rational Probabilistic Beliefs
Advances in the general capabilities of large language models (LLMs) have led
to their use for information retrieval, and as components in automated decision
systems. A faithful representation of probabilistic reasoning in these models
may be essential to ensure trustworthy, explainable and effective performance
in these tasks. Despite previous work suggesting that LLMs can perform complex
reasoning and well-calibrated uncertainty quantification, we find that current
versions of this class of model lack the ability to provide rational and
coherent representations of probabilistic beliefs. To demonstrate this, we
introduce a novel dataset of claims with indeterminate truth values and apply a
number of well-established techniques for uncertainty quantification to measure
the ability of LLM's to adhere to fundamental properties of probabilistic
reasoning.
comment: 8 pages, 4 figures
☆ Simulating Before Planning: Constructing Intrinsic User World Model for User-Tailored Dialogue Policy Planning SIGIR 2025
Recent advancements in dialogue policy planning have emphasized optimizing
system agent policies to achieve predefined goals, focusing on strategy design,
trajectory acquisition, and efficient training paradigms. However, these
approaches often overlook the critical role of user characteristics, which are
essential in real-world scenarios like conversational search and
recommendation, where interactions must adapt to individual user traits such as
personality, preferences, and goals. To address this gap, we first conduct a
comprehensive study utilizing task-specific user personas to systematically
assess dialogue policy planning under diverse user behaviors. By leveraging
realistic user profiles for different tasks, our study reveals significant
limitations in existing approaches, highlighting the need for user-tailored
dialogue policy planning. Building on this foundation, we present the
User-Tailored Dialogue Policy Planning (UDP) framework, which incorporates an
Intrinsic User World Model to model user traits and feedback. UDP operates in
three stages: (1) User Persona Portraying, using a diffusion model to
dynamically infer user profiles; (2) User Feedback Anticipating, leveraging a
Brownian Bridge-inspired anticipator to predict user reactions; and (3)
User-Tailored Policy Planning, integrating these insights to optimize response
strategies. To ensure robust performance, we further propose an active learning
approach that prioritizes challenging user personas during training.
Comprehensive experiments on benchmarks, including collaborative and
non-collaborative settings, demonstrate the effectiveness of UDP in learning
user-specific dialogue strategies. Results validate the protocol's utility and
highlight UDP's robustness, adaptability, and potential to advance user-centric
dialogue systems.
comment: 11 pages, 6 figures, SIGIR 2025
☆ Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling
A key challenge in MT evaluation is the inherent noise and inconsistency of
human ratings. Regression-based neural metrics struggle with this noise, while
prompting LLMs shows promise at system-level evaluation but performs poorly at
segment level. In this work, we propose ReMedy, a novel MT metric framework
that reformulates translation evaluation as a reward modeling task. Instead of
regressing on imperfect human ratings directly, ReMedy learns relative
translation quality using pairwise preference data, resulting in a more
reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39
language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance
at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses
larger WMT winners and massive closed LLMs such as MetricX-13B,
XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses
demonstrate that ReMedy delivers superior capability in detecting translation
errors and evaluating low-quality translations.
☆ Divergent LLM Adoption and Heterogeneous Convergence Paths in Research Writing
Large Language Models (LLMs), such as ChatGPT, are reshaping content creation
and academic writing. This study investigates the impact of AI-assisted
generative revisions on research manuscripts, focusing on heterogeneous
adoption patterns and their influence on writing convergence. Leveraging a
dataset of over 627,000 academic papers from arXiv, we develop a novel
classification framework by fine-tuning prompt- and discipline-specific large
language models to detect the style of ChatGPT-revised texts. Our findings
reveal substantial disparities in LLM adoption across academic disciplines,
gender, native language status, and career stage, alongside a rapid evolution
in scholarly writing styles. Moreover, LLM usage enhances clarity, conciseness,
and adherence to formal writing conventions, with improvements varying by
revision type. Finally, a difference-in-differences analysis shows that while
LLMs drive convergence in academic writing, early adopters, male researchers,
non-native speakers, and junior scholars exhibit the most pronounced stylistic
shifts, aligning their writing more closely with that of established
researchers.
☆ Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models
Yule Liu, Jingyi Zheng, Zhen Sun, Zifan Peng, Wenhan Dong, Zeyang Sha, Shiwen Cui, Weiqiang Wang, Xinlei He
Recent advancements in large reasoning models (LRMs) have demonstrated the
effectiveness of scaling test-time computation to enhance reasoning
capabilities in multiple tasks. However, LRMs typically suffer from
"overthinking" problems, where models generate significantly redundant
reasoning steps while bringing limited performance gains. Existing work relies
on fine-tuning to mitigate overthinking, which requires additional data,
unconventional training setups, risky safety misalignment, and poor
generalization.
Through empirical analysis, we reveal an important characteristic of LRM
behaviors that placing external CoTs generated by smaller models between the
thinking token ($\texttt{}$ and $\texttt{ )}$ can effectively
manipulate the model to generate fewer thoughts. Building on these insights, we
propose a simple yet efficient pipeline, ThoughtMani, to enable LRMs to bypass
unnecessary intermediate steps and reduce computational costs significantly. We
conduct extensive experiments to validate the utility and efficiency of
ThoughtMani. For instance, when applied to QwQ-32B on the LiveBench/Code
dataset, ThoughtMani keeps the original performance and reduces output token
counts by approximately 30%, with little overhead from the CoT generator.
Furthermore, we find that ThoughtMani enhances safety alignment by an average
of 10%. Since model vendors typically serve models of different sizes
simultaneously, ThoughtMani provides an effective way to construct more
efficient and accessible LRMs for real-world applications.
☆ Long-context Non-factoid Question Answering in Indic Languages
Question Answering (QA) tasks, which involve extracting answers from a given
context, are relatively straightforward for modern Large Language Models (LLMs)
when the context is short. However, long contexts pose challenges due to the
quadratic complexity of the self-attention mechanism. This challenge is
compounded in Indic languages, which are often low-resource. This study
explores context-shortening techniques, including Open Information Extraction
(OIE), coreference resolution, Answer Paragraph Selection (APS), and their
combinations, to improve QA performance. Compared to the baseline of
unshortened (long) contexts, our experiments on four Indic languages (Hindi,
Tamil, Telugu, and Urdu) demonstrate that context-shortening techniques yield
an average improvement of 4\% in semantic scores and 47\% in token-level scores
when evaluated on three popular LLMs without fine-tuning. Furthermore, with
fine-tuning, we achieve an average increase of 2\% in both semantic and
token-level scores. Additionally, context-shortening reduces computational
overhead. Explainability techniques like LIME and SHAP reveal that when the APS
model confidently identifies the paragraph containing the answer, nearly all
tokens within the selected text receive high relevance scores. However, the
study also highlights the limitations of LLM-based QA systems in addressing
non-factoid questions, particularly those requiring reasoning or debate.
Moreover, verbalizing OIE-generated triples does not enhance system
performance. These findings emphasize the potential of context-shortening
techniques to improve the efficiency and effectiveness of LLM-based QA systems,
especially for low-resource languages. The source code and resources are
available at https://github.com/ritwikmishra/IndicGenQA.
☆ Continual Pre-Training is (not) What You Need in Domain Adaption
Pin-Er Chen, Da-Chen Lian, Shu-Kai Hsieh, Sieh-Chuen Huang, Hsuan-Lei Shao, Jun-Wei Chiu, Yang-Hsien Lin, Zih-Ching Chen, Cheng-Kuang, Eddie TC Huang, Simon See
The recent advances in Legal Large Language Models (LLMs) have transformed
the landscape of legal research and practice by automating tasks, enhancing
research precision, and supporting complex decision-making processes. However,
effectively adapting LLMs to the legal domain remains challenging due to the
complexity of legal reasoning, the need for precise interpretation of
specialized language, and the potential for hallucinations. This paper examines
the efficacy of Domain-Adaptive Continual Pre-Training (DACP) in improving the
legal reasoning capabilities of LLMs. Through a series of experiments on legal
reasoning tasks within the Taiwanese legal framework, we demonstrate that while
DACP enhances domain-specific knowledge, it does not uniformly improve
performance across all legal tasks. We discuss the trade-offs involved in DACP,
particularly its impact on model generalization and performance in prompt-based
tasks, and propose directions for future research to optimize domain adaptation
strategies in legal AI.
comment: 11 pages, 2 figures
☆ Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling
Intent detection, a critical component in task-oriented dialogue (TOD)
systems, faces significant challenges in adapting to the rapid influx of
integrable tools with complex interrelationships. Existing approaches, such as
zero-shot reformulations and LLM-based dynamic recognition, struggle with
performance degradation when encountering unseen intents, leading to erroneous
task routing. To enhance the model's generalization performance on unseen
tasks, we employ Reinforcement Learning (RL) combined with a Reward-based
Curriculum Sampling (RCS) during Group Relative Policy Optimization (GRPO)
training in intent detection tasks. Experiments demonstrate that RL-trained
models substantially outperform supervised fine-tuning (SFT) baselines in
generalization. Besides, the introduction of the RCS, significantly bolsters
the effectiveness of RL in intent detection by focusing the model on
challenging cases during training. Moreover, incorporating Chain-of-Thought
(COT) processes in RL notably improves generalization in complex intent
detection tasks, underscoring the importance of thought in challenging
scenarios. This work advances the generalization of intent detection tasks,
offering practical insights for deploying adaptable dialogue systems.
☆ DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification
With the widespread adoption of Large Language Models (LLMs), jailbreak
attacks have become an increasingly pressing safety concern. While
safety-aligned LLMs can effectively defend against normal harmful queries, they
remain vulnerable to such attacks. Existing defense methods primarily rely on
fine-tuning or input modification, which often suffer from limited
generalization and reduced utility. To address this, we introduce DETAM, a
finetuning-free defense approach that improves the defensive capabilities
against jailbreak attacks of LLMs via targeted attention modification.
Specifically, we analyze the differences in attention scores between successful
and unsuccessful defenses to identify the attention heads sensitive to
jailbreak attacks. During inference, we reallocate attention to emphasize the
user's core intention, minimizing interference from attack tokens. Our
experimental results demonstrate that DETAM outperforms various baselines in
jailbreak defense and exhibits robust generalization across different attacks
and models, maintaining its effectiveness even on in-the-wild jailbreak data.
Furthermore, in evaluating the model's utility, we incorporated over-defense
datasets, which further validate the superior performance of our approach. The
code will be released immediately upon acceptance.
☆ Q-FAKER: Query-free Hard Black-box Attack via Controlled Generation NAACL 2025
Many adversarial attack approaches are proposed to verify the vulnerability
of language models. However, they require numerous queries and the information
on the target model. Even black-box attack methods also require the target
model's output information. They are not applicable in real-world scenarios, as
in hard black-box settings where the target model is closed and inaccessible.
Even the recently proposed hard black-box attacks still require many queries
and demand extremely high costs for training adversarial generators. To address
these challenges, we propose Q-faker (Query-free Hard Black-box Attacker), a
novel and efficient method that generates adversarial examples without
accessing the target model. To avoid accessing the target model, we use a
surrogate model instead. The surrogate model generates adversarial sentences
for a target-agnostic attack. During this process, we leverage controlled
generation techniques. We evaluate our proposed method on eight datasets.
Experimental results demonstrate our method's effectiveness including high
transferability and the high quality of the generated adversarial examples, and
prove its practical in hard black-box settings.
comment: NAACL 2025 Findings
☆ Enhancing Multilingual Sentiment Analysis with Explainability for Sinhala, English, and Code-Mixed Content
Azmarah Rizvi, Navojith Thamindu, A. M. N. H. Adhikari, W. P. U. Senevirathna, Dharshana Kasthurirathna, Lakmini Abeywardhana
Sentiment analysis is crucial for brand reputation management in the banking
sector, where customer feedback spans English, Sinhala, Singlish, and
code-mixed text. Existing models struggle with low-resource languages like
Sinhala and lack interpretability for practical use. This research develops a
hybrid aspect-based sentiment analysis framework that enhances multilingual
capabilities with explainable outputs. Using cleaned banking customer reviews,
we fine-tune XLM-RoBERTa for Sinhala and code-mixed text, integrate
domain-specific lexicon correction, and employ BERT-base-uncased for English.
The system classifies sentiment (positive, neutral, negative) with confidence
scores, while SHAP and LIME improve interpretability by providing real-time
sentiment explanations. Experimental results show that our approaches
outperform traditional transformer-based classifiers, achieving 92.3 percent
accuracy and an F1-score of 0.89 in English and 88.4 percent in Sinhala and
code-mixed content. An explainability analysis reveals key sentiment drivers,
improving trust and transparency. A user-friendly interface delivers
aspect-wise sentiment insights, ensuring accessibility for businesses. This
research contributes to robust, transparent sentiment analysis for financial
applications by bridging gaps in multilingual, low-resource NLP and
explainability.
comment: 6 pages, 6 figures, 4 tables
☆ CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models
Feiyang Li, Peng Fang, Zhan Shi, Arijit Khan, Fang Wang, Dan Feng, Weihao Wang, Xin Zhang, Yongjian Cui
While chain-of-thought (CoT) reasoning improves the performance of large
language models (LLMs) in complex tasks, it still has two main challenges: the
low reliability of relying solely on LLMs to generate reasoning chains and the
interference of natural language reasoning chains on the inference logic of
LLMs. To address these issues, we propose CoT-RAG, a novel reasoning framework
with three key designs: (i) Knowledge Graph-driven CoT Generation, featuring
knowledge graphs to modulate reasoning chain generation of LLMs, thereby
enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which
incorporates retrieval-augmented generation (RAG) into knowledge graphs to
retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable
information; (iii) Pseudo-Program Prompting Execution, which encourages LLMs to
execute reasoning tasks in pseudo-programs with greater logical rigor. We
conduct a comprehensive evaluation on nine public datasets, covering three
reasoning problems. Compared with the-state-of-the-art methods, CoT-RAG
exhibits a significant accuracy improvement, ranging from 4.0% to 23.0%.
Furthermore, testing on four domain-specific datasets, CoT-RAG shows remarkable
accuracy and efficient execution, highlighting its strong practical
applicability and scalability.
☆ Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning
In this paper, we introduce a new \emph{process prejudge} strategy in LLM
reasoning to demonstrate that bootstrapping with process prejudge allows the
LLM to adaptively anticipate the errors encountered when advancing the
subsequent reasoning steps, similar to people sometimes pausing to think about
what mistakes may occur and how to avoid them, rather than relying solely on
trial and error. Specifically, we define a prejudge node in the rationale,
which represents a reasoning step, with at least one step that follows the
prejudge node that has no paths toward the correct answer. To synthesize the
prejudge reasoning process, we present an automated reasoning framework with a
dynamic tree-searching strategy. This framework requires only one LLM to
perform answer judging, response critiquing, prejudge generation, and thought
completion. Furthermore, we develop a two-phase training mechanism with
supervised fine-tuning (SFT) and reinforcement learning (RL) to further enhance
the reasoning capabilities of LLMs. Experimental results from competition-level
complex reasoning demonstrate that our method can teach the model to prejudge
before thinking and significantly enhance the reasoning ability of LLMs. Code
and data is released at https://github.com/wjn1996/Prejudge-Before-Think.
☆ Integrating Locality-Aware Attention with Transformers for General Geometry PDEs IJCNN 2025
Neural operators have emerged as promising frameworks for learning mappings
governed by partial differential equations (PDEs), serving as data-driven
alternatives to traditional numerical methods. While methods such as the
Fourier neural operator (FNO) have demonstrated notable performance, their
reliance on uniform grids restricts their applicability to complex geometries
and irregular meshes. Recently, Transformer-based neural operators with linear
attention mechanisms have shown potential in overcoming these limitations for
large-scale PDE simulations. However, these approaches predominantly emphasize
global feature aggregation, often overlooking fine-scale dynamics and localized
PDE behaviors essential for accurate solutions. To address these challenges, we
propose the Locality-Aware Attention Transformer (LA2Former), which leverages
K-nearest neighbors for dynamic patchifying and integrates global-local
attention for enhanced PDE modeling. By combining linear attention for
efficient global context encoding with pairwise attention for capturing
intricate local interactions, LA2Former achieves an optimal balance between
computational efficiency and predictive accuracy. Extensive evaluations across
six benchmark datasets demonstrate that LA2Former improves predictive accuracy
by over 50% relative to existing linear attention methods, while also
outperforming full pairwise attention under optimal conditions. This work
underscores the critical importance of localized feature learning in advancing
Transformer-based neural operators for solving PDEs on complex and irregular
domains.
comment: Accepted by IJCNN 2025
☆ LLM Sensitivity Evaluation Framework for Clinical Diagnosis
Large language models (LLMs) have demonstrated impressive performance across
various domains. However, for clinical diagnosis, higher expectations are
required for LLM's reliability and sensitivity: thinking like physicians and
remaining sensitive to key medical information that affects diagnostic
reasoning, as subtle variations can lead to different diagnosis results. Yet,
existing works focus mainly on investigating the sensitivity of LLMs to
irrelevant context and overlook the importance of key information. In this
paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini,
Claude3 and LLaMA2-7b, to key medical information by introducing different
perturbation strategies. The evaluation results highlight the limitations of
current LLMs in remaining sensitive to key medical information for diagnostic
decision-making. The evolution of LLMs must focus on improving their
reliability, enhancing their ability to be sensitive to key information, and
effectively utilizing this information. These improvements will enhance human
trust in LLMs and facilitate their practical application in real-world
scenarios. Our code and dataset are available at
https://github.com/chenwei23333/DiagnosisQA.
☆ CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation
Large language models (LLMs) have demonstrated strong capabilities in code
generation, underscoring the critical need for rigorous and comprehensive
evaluation. Existing evaluation approaches fall into three categories,
including human-centered, metric-based, and LLM-based. Considering that
human-centered approaches are labour-intensive and metric-based ones overly
rely on reference answers, LLM-based approaches are gaining increasing
attention due to their stronger contextual understanding capabilities and
superior efficiency. However, the performance of LLM-based approaches remains
limited due to: (1) lack of multisource domain knowledge, and (2) insufficient
comprehension of complex code.
To mitigate the limitations, we propose CodeVisionary, the first LLM-based
agent framework for evaluating LLMs in code generation. CodeVisionary consists
of two stages: (1) Multiscore knowledge analysis stage, which aims to gather
multisource and comprehensive domain knowledge by formulating and executing a
stepwise evaluation plan. (2) Negotiation-based scoring stage, which involves
multiple judges engaging in discussions to better comprehend the complex code
and reach a consensus on the evaluation score. Extensive experiments
demonstrate that CodeVisionary achieves the best performance for evaluating
LLMs in code generation, outperforming the best baseline methods with average
improvements of 0.202, 0.139, and 0.117 in Pearson, Spearman, and Kendall-Tau
coefficients, respectively. Besides, CodeVisionary provides detailed evaluation
reports, which assist developers in identifying shortcomings and making
improvements. The resources of CodeVisionary are available at
https://anonymous.4open.science/r/CodeVisionary.
☆ From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
Jiliang Ni, Jiachen Pu, Zhongyi Yang, Kun Zhou, Hui Wang, Xiaoliang Xiao, Dakui Wang, Xin Li, Jingfeng Luo, Conggang Hu
In recent years, Large Language Models (LLMs) have significantly advanced
artificial intelligence by optimizing traditional Natural Language Processing
(NLP) pipelines, improving performance and generalization. This has spurred
their integration into various systems. Many NLP systems, including ours,
employ a "one-stage" pipeline directly incorporating LLMs. While effective,
this approach incurs substantial costs and latency due to the need for large
model parameters to achieve satisfactory outcomes. This paper introduces a
three-stage cost-efficient end-to-end LLM deployment pipeline-including
prototyping, knowledge transfer, and model compression-to tackle the
cost-performance dilemma in LLM-based frameworks. Our approach yields a super
tiny model optimized for cost and performance in online systems, simplifying
the system architecture. Initially, by transforming complex tasks into a
function call-based LLM-driven pipeline, an optimal performance prototype
system is constructed to produce high-quality data as a teacher model. The
second stage combine techniques like rejection fine-tuning, reinforcement
learning and knowledge distillation to transfer knowledge to a smaller 0.5B
student model, delivering effective performance at minimal cost. The final
stage applies quantization and pruning to extremely compress model to 0.4B,
achieving ultra-low latency and cost. The framework's modular design and
cross-domain capabilities suggest potential applicability in other NLP areas.
☆ D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model
Evaluating generative models with open-ended generation is challenging due to
inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates
this issue, but generating high-quality distractors is time-consuming and
labor-intensive. We introduce D-GEN, the first open-source distractor generator
model that transforms open-ended data into an MC format. To evaluate distractor
quality, we propose two novel methods: (1) ranking alignment, ensuring
generated distractors retain the discriminatory power of ground-truth
distractors, and (2) entropy analysis, comparing model confidence
distributions. Our results show that D-GEN preserves ranking consistency
(Spearman's rho 0.99, Kendall's tau 0.94) and closely matches the entropy
distribution of ground-truth distractors. Human evaluation further confirms the
fluency, coherence, distractiveness, and incorrectness. Our work advances
robust and efficient distractor generation with automated evaluation, setting a
new standard for MC evaluation.
☆ Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering
Existing Retrieval-Augmented Generation (RAG) systems face challenges in
enterprise settings due to limited retrieval scope and data security risks.
When relevant internal documents are unavailable, the system struggles to
generate accurate and complete responses. Additionally, using closed-source
Large Language Models (LLMs) raises concerns about exposing proprietary
information. To address these issues, we propose the Secure Multifaceted-RAG
(SecMulti-RAG) framework, which retrieves not only from internal documents but
also from two supplementary sources: pre-generated expert knowledge for
anticipated queries and on-demand external LLM-generated knowledge. To mitigate
security risks, we adopt a local open-source generator and selectively utilize
external LLMs only when prompts are deemed safe by a filtering mechanism. This
approach enhances completeness, prevents data leakage, and reduces costs. In
our evaluation on a report generation task in the automotive industry,
SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9
percent win rates across correctness, richness, and helpfulness in LLM-based
evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights
SecMulti-RAG as a practical and secure solution for enterprise RAG.
☆ STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings ICLR 2025
Given how large parts of publicly available text are crawled to pretrain
large language models (LLMs), data creators increasingly worry about the
inclusion of their proprietary data for model training without attribution or
licensing. Their concerns are also shared by benchmark curators whose test-sets
might be compromised. In this paper, we present STAMP, a framework for
detecting dataset membership-i.e., determining the inclusion of a dataset in
the pretraining corpora of LLMs. Given an original piece of content, our
proposal involves first generating multiple rephrases, each embedding a
watermark with a unique secret key. One version is to be released publicly,
while others are to be kept private. Subsequently, creators can compare model
likelihoods between public and private versions using paired statistical tests
to prove membership. We show that our framework can successfully detect
contamination across four benchmarks which appear only once in the training
data and constitute less than 0.001% of the total tokens, outperforming several
contamination detection and dataset inference baselines. We verify that STAMP
preserves both the semantic meaning and the utility of the original data in
comparing different models. We apply STAMP to two real-world scenarios to
confirm the inclusion of paper abstracts and blog articles in the pretraining
corpora.
comment: Accepted at DATA-FM, WMark @ ICLR 2025. Project page at see
https://codeboy5.github.io/stamp
☆ LangCoop: Collaborative Driving with Language
Multi-agent collaboration holds great promise for enhancing the safety,
reliability, and mobility of autonomous driving systems by enabling information
sharing among multiple connected agents. However, existing multi-agent
communication approaches are hindered by limitations of existing communication
media, including high bandwidth demands, agent heterogeneity, and information
loss. To address these challenges, we introduce LangCoop, a new paradigm for
collaborative autonomous driving that leverages natural language as a compact
yet expressive medium for inter-agent communication. LangCoop features two key
innovations: Mixture Model Modular Chain-of-thought (M$^3$CoT) for structured
zero-shot vision-language reasoning and Natural Language Information Packaging
(LangPack) for efficiently packaging information into concise, language-based
messages. Through extensive experiments conducted in the CARLA simulations, we
demonstrate that LangCoop achieves a remarkable 96\% reduction in communication
bandwidth (< 2KB per message) compared to image-based communication, while
maintaining competitive driving performance in the closed-loop evaluation.
☆ A mean teacher algorithm for unlearning of language models
One of the goals of language model unlearning is to reduce memorization of
selected text instances while retaining the model's general abilities. Despite
various proposed methods, reducing memorization of large datasets without
noticeable degradation in model utility remains challenging. In this paper, we
investigate the mean teacher algorithm (Tarvainen & Valpola, 2017), a simple
proximal optimization method from continual learning literature that gradually
modifies the teacher model. We show that the mean teacher can approximate a
trajectory of a slow natural gradient descent (NGD), which inherently seeks
low-curvature updates that are less likely to degrade the model utility. While
slow NGD can suffer from vanishing gradients, we introduce a new unlearning
loss called "negative log-unlikelihood" (NLUL) that avoids this problem. We
show that the combination of mean teacher and NLUL improves some metrics on the
MUSE benchmarks (Shi et al., 2024).
♻ ☆ A-MEM: Agentic Memory for LLM Agents
While large language model (LLM) agents can effectively use external tools
for complex real-world tasks, they require memory systems to leverage
historical experiences. Current memory systems enable basic storage and
retrieval but lack sophisticated memory organization, despite recent attempts
to incorporate graph databases. Moreover, these systems' fixed operations and
structures limit their adaptability across diverse tasks. To address this
limitation, this paper proposes a novel agentic memory system for LLM agents
that can dynamically organize memories in an agentic way. Following the basic
principles of the Zettelkasten method, we designed our memory system to create
interconnected knowledge networks through dynamic indexing and linking. When a
new memory is added, we generate a comprehensive note containing multiple
structured attributes, including contextual descriptions, keywords, and tags.
The system then analyzes historical memories to identify relevant connections,
establishing links where meaningful similarities exist. Additionally, this
process enables memory evolution - as new memories are integrated, they can
trigger updates to the contextual representations and attributes of existing
historical memories, allowing the memory network to continuously refine its
understanding. Our approach combines the structured organization principles of
Zettelkasten with the flexibility of agent-driven decision making, allowing for
more adaptive and context-aware memory management. Empirical experiments on six
foundation models show superior improvement against existing SOTA baselines.
The source code for evaluating performance is available at
https://github.com/WujiangXu/AgenticMemory, while the source code of agentic
memory system is available at https://github.com/agiresearch/A-mem.
♻ ☆ From Token to Line: Enhancing Code Generation with a Long-Term Perspective
Tingwei Lu, Yangning Li, Liyuan Wang, Binghuai Lin, Jiwei Tang, Wanshi Xu, Hai-Tao Zheng, Yinghui Li, Bingxu An, Zhao Wei, Yong Xu
The emergence of large language models (LLMs) has significantly promoted the
development of code generation task, sparking a surge in pertinent literature.
Current research is hindered by redundant generation results and a tendency to
overfit local patterns in the short term. Although existing studies attempt to
alleviate the issue by adopting a multi-token prediction strategy, there
remains limited focus on choosing the appropriate processing length for
generations. By analyzing the attention between tokens during the generation
process of LLMs, it can be observed that the high spikes of the attention
scores typically appear at the end of lines. This insight suggests that it is
reasonable to treat each line of code as a fundamental processing unit and
generate them sequentially. Inspired by this, we propose the \textbf{LSR-MCTS}
algorithm, which leverages MCTS to determine the code line-by-line and select
the optimal path. Further, we integrate a self-refine mechanism at each node to
enhance diversity and generate higher-quality programs through error
correction. Extensive experiments and comprehensive analyses on three public
coding benchmarks demonstrate that our method outperforms the state-of-the-art
performance approaches.
♻ ☆ GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Existing efforts in building Graphical User Interface (GUI) agents largely
rely on the training paradigm of supervised fine-tuning on Large
Vision-Language Models (LVLMs). However, this approach not only demands
extensive amounts of training data but also struggles to effectively understand
GUI screenshots and generalize to unseen interfaces. The issue significantly
limits its application in real-world scenarios, especially for high-level
tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models
(e.g., DeepSeek-R1), which efficiently enhances the problem-solving
capabilities of large language models in real-world settings, we propose \name,
the first reinforcement learning framework designed to enhance the GUI
capabilities of LVLMs in high-level real-world task scenarios, through unified
action space rule modeling. By leveraging a small amount of carefully curated
high-quality data across multiple platforms (including Windows, Linux, MacOS,
Android, and Web) and employing policy optimization algorithms such as Group
Relative Policy Optimization (GRPO) to update the model, \name achieves
superior performance using only 0.02\% of the data (3K vs. 13M) compared to
previous state-of-the-art methods like OS-Atlas across eight benchmarks
spanning three different platforms (mobile, desktop, and web). These results
demonstrate the immense potential of reinforcement learning based on unified
action space rule modeling in improving the execution capabilities of LVLMs for
real-world GUI agent tasks.
♻ ☆ SysCaps: Language Interfaces for Simulation Surrogates of Complex Systems ICLR 2025
Surrogate models are used to predict the behavior of complex energy systems
that are too expensive to simulate with traditional numerical methods. Our work
introduces the use of language descriptions, which we call ``system captions''
or SysCaps, to interface with such surrogates. We argue that interacting with
surrogates through text, particularly natural language, makes these models more
accessible for both experts and non-experts. We introduce a lightweight
multimodal text and timeseries regression model and a training pipeline that
uses large language models (LLMs) to synthesize high-quality captions from
simulation metadata. Our experiments on two real-world simulators of buildings
and wind farms show that our SysCaps-augmented surrogates have better accuracy
on held-out systems than traditional methods while enjoying new generalization
abilities, such as handling semantically related descriptions of the same test
system. Additional experiments also highlight the potential of SysCaps to
unlock language-driven design space exploration and to regularize training
through prompt augmentation.
comment: Accepted at ICLR 2025. 23 pages. Updated with final camera ready
version
♻ ☆ C-MTCSD: A Chinese Multi-Turn Conversational Stance Detection Dataset WWW2025
Stance detection has become an essential tool for analyzing public
discussions on social media. Current methods face significant challenges,
particularly in Chinese language processing and multi-turn conversational
analysis. To address these limitations, we introduce C-MTCSD, the largest
Chinese multi-turn conversational stance detection dataset, comprising 24,264
carefully annotated instances from Sina Weibo, which is 4.2 times larger than
the only prior Chinese conversational stance detection dataset. Our
comprehensive evaluation using both traditional approaches and large language
models reveals the complexity of C-MTCSD: even state-of-the-art models achieve
only 64.07% F1 score in the challenging zero-shot setting, while performance
consistently degrades with increasing conversation depth. Traditional models
particularly struggle with implicit stance detection, achieving below 50% F1
score. This work establishes a challenging new benchmark for Chinese stance
detection research, highlighting significant opportunities for future
improvements.
comment: WWW2025
♻ ☆ Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations NAACL 2025
Humans are efficient language learners and inherently social creatures. Our
language development is largely shaped by our social interactions, for example,
the demonstration and feedback from caregivers. Contrary to human language
learning, recent advancements in large language models have primarily adopted a
non-interactive training paradigm, and refined pre-trained models through
feedback afterward. In this work, we explore how corrective feedback from
interactions influences neural language acquisition from scratch through
systematically controlled experiments, assessing whether it contributes to word
learning efficiency in language models. We introduce a trial-and-demonstration
(TnD) learning framework that incorporates three distinct components: student
trials, teacher demonstrations, and a reward conditioned on language competence
at various developmental stages. Our experiments reveal that the TnD approach
accelerates word acquisition for student models of equal and smaller numbers of
parameters, and we highlight the significance of both trials and
demonstrations. We further show that the teacher's choices of words influence
students' word-specific learning efficiency, and a practice-makes-perfect
effect is evident by a strong correlation between the frequency of words in
trials and their respective learning curves. Our findings suggest that
interactive language learning, with teacher demonstrations and active trials,
can facilitate efficient word learning in language models.
comment: NAACL 2025 (Main) & Workshop on Large Language Models and Cognition @
ICML 2024 (Oral)
♻ ☆ Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs
Christoph Schuhmann, Gollam Rabby, Ameya Prabhu, Tawsif Ahmed, Andreas Hochlehnert, Huu Nguyen, Nick Akinci, Ludwig Schmidt, Robert Kaczmarczyk, Sören Auer, Jenia Jitsev, Matthias Bethge
Paywalls, licenses and copyright rules often restrict the broad dissemination
and reuse of scientific knowledge. We take the position that it is both legally
and technically feasible to extract the scientific knowledge in scholarly
texts. Current methods, like text embeddings, fail to reliably preserve factual
content, and simple paraphrasing may not be legally sound. We propose a new
idea for the community to adopt: convert scholarly documents into knowledge
preserving, but style agnostic representations we term Knowledge Units using
LLMs. These units use structured data capturing entities, attributes and
relationships without stylistic content. We provide evidence that Knowledge
Units (1) form a legally defensible framework for sharing knowledge from
copyrighted research texts, based on legal analyses of German copyright law and
U.S. Fair Use doctrine, and (2) preserve most (~95\%) factual knowledge from
original text, measured by MCQ performance on facts from the original
copyrighted text across four research domains. Freeing scientific knowledge
from copyright promises transformative benefits for scientific research and
education by allowing language models to reuse important facts from copyrighted
text. To support this, we share open-source tools for converting research
documents into Knowledge Units. Overall, our work posits the feasibility of
democratizing access to scientific knowledge while respecting copyright.
comment: Technical Report
♻ ☆ Only Send What You Need: Learning to Communicate Efficiently in Federated Multilingual Machine Translation
Federated learning (FL) is a promising distributed machine learning paradigm
that enables multiple clients to collaboratively train a global model. In this
paper, we focus on a practical federated multilingual learning setup where
clients with their own language-specific data aim to collaboratively construct
a high-quality neural machine translation (NMT) model. However, communication
constraints in practical network systems present challenges for exchanging
large-scale NMT engines between FL parties. We propose a meta-learning-based
adaptive parameter selection methodology, MetaSend, that improves the
communication efficiency of model transmissions from clients during FL-based
multilingual NMT training. Our approach learns a dynamic threshold for
filtering parameters prior to transmission without compromising the NMT model
quality, based on the tensor deviations of clients between different FL rounds.
Through experiments on two NMT datasets with different language distributions,
we demonstrate that MetaSend obtains substantial improvements over baselines in
translation quality in the presence of a limited communication budget.
♻ ☆ Understanding Epistemic Language with a Language-augmented Bayesian Theory of Mind ACL
How do people understand and evaluate claims about others' beliefs, even
though these beliefs cannot be directly observed? In this paper, we introduce a
cognitive model of epistemic language interpretation, grounded in Bayesian
inferences about other agents' goals, beliefs, and intentions: a
language-augmented Bayesian theory-of-mind (LaBToM). By translating natural
language into an epistemic ``language-of-thought'' with grammar-constrained LLM
decoding, then evaluating these translations against the inferences produced by
inverting a generative model of rational action and perception, LaBToM captures
graded plausibility judgments of epistemic claims. We validate our model in an
experiment where participants watch an agent navigate a maze to find keys
hidden in boxes needed to reach their goal, then rate sentences about the
agent's beliefs. In contrast with multimodal LLMs (GPT-4o, Gemini Pro) and
ablated models, our model correlates highly with human judgments for a wide
range of expressions, including modal language, uncertainty expressions,
knowledge claims, likelihood comparisons, and attributions of false belief.
comment: 23 pages; Published at the Transactions of the Association for
Computational Linguistics (TACL); Presented at NAACL 2025
♻ ☆ AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents ICLR 2025
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies
The robustness of LLMs to jailbreak attacks, where users design prompts to
circumvent safety measures and misuse model capabilities, has been studied
primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which
use external tools and can execute multi-stage tasks -- may pose a greater risk
if misused, but their robustness remains underexplored. To facilitate research
on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark
includes a diverse set of 110 explicitly malicious agent tasks (440 with
augmentations), covering 11 harm categories including fraud, cybercrime, and
harassment. In addition to measuring whether models refuse harmful agentic
requests, scoring well on AgentHarm requires jailbroken agents to maintain
their capabilities following an attack to complete a multi-step task. We
evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly
compliant with malicious agent requests without jailbreaking, (2) simple
universal jailbreak templates can be adapted to effectively jailbreak agents,
and (3) these jailbreaks enable coherent and malicious multi-step agent
behavior and retain model capabilities. To enable simple and reliable
evaluation of attacks and defenses for LLM-based agents, we publicly release
AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm.
comment: Accepted at ICLR 2025
♻ ☆ A Theory of LLM Sampling: Part Descriptive and Part Prescriptive
Large Language Models (LLMs) are increasingly utilized in autonomous
decision-making, where they sample options from vast action spaces. However,
the heuristics that guide this sampling process remain under-explored. We study
this sampling behavior and show that this underlying heuristics resembles that
of human decision-making: comprising a descriptive component (reflecting
statistical norm) and a prescriptive component (implicit ideal encoded in the
LLM) of a concept. We show that this deviation of a sample from the statistical
norm towards a prescriptive component consistently appears in concepts across
diverse real-world domains like public health, and economic trends. To further
illustrate the theory, we demonstrate that concept prototypes in LLMs are
affected by prescriptive norms, similar to the concept of normality in humans.
Through case studies and comparison with human studies, we illustrate that in
real-world applications, the shift of samples toward an ideal value in LLMs'
outputs can result in significantly biased decision-making, raising ethical
concerns.
♻ ☆ Can postgraduate translation students identify machine-generated text?
Given the growing use of generative artificial intelligence as a tool for
creating multilingual content and bypassing both machine and traditional
translation methods, this study explores the ability of linguistically trained
individuals to discern machine-generated output from human-written text (HT).
After brief training sessions on the textual anomalies typically found in
synthetic text (ST), twenty-three postgraduate translation students analysed
excerpts of Italian prose and assigned likelihood scores to indicate whether
they believed they were human-written or AI-generated (ChatGPT-4o). The results
show that, on average, the students struggled to distinguish between HT and ST,
with only two participants achieving notable accuracy. Closer analysis revealed
that the students often identified the same textual anomalies in both HT and
ST, although features such as low burstiness and self-contradiction were more
frequently associated with ST. These findings suggest the need for improvements
in the preparatory training. Moreover, the study raises questions about the
necessity of editing synthetic text to make it sound more human-like and
recommends further research to determine whether AI-generated text is already
sufficiently natural-sounding not to require further refinement.
comment: 10 pages, accepted for MT Summit 2025, Geneva, Switzerland, 23-27
June 2025
♻ ☆ The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators
As large language models (LLMs) are increasingly used as evaluators for
natural language generation tasks, ensuring unbiased assessments is essential.
However, LLM evaluators often display biased preferences, such as favoring
verbosity and authoritative tones. Our empirical analysis reveals that these
biases are exacerbated in pairwise evaluation, where LLMs directly compare two
outputs and easily prioritize superficial attributes. In contrast, pointwise
evaluation, which assesses outputs independently, is less susceptible to such
bias because each output is judged in isolation. To address the limitations of
the pairwise evaluation, we introduce a novel evaluation method, PRePair, which
integrates pointwise reasoning within a pairwise framework. PRePair effectively
alleviates biased preference, improving performance on the adversarial
benchmark (LLMBar) while outperforming pointwise evaluation on the standard
benchmark (MT-Bench).
♻ ☆ Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?
Recent advancements in integrating large language models (LLMs) with tools
have allowed the models to interact with real-world environments. However,
these tool-augmented LLMs often encounter incomplete scenarios when users
provide partial information or the necessary tools are unavailable. Recognizing
and managing such scenarios is crucial for LLMs to ensure their reliability,
but this exploration remains understudied. This study examines whether LLMs can
identify incomplete conditions and appropriately determine when to refrain from
using tools. To this end, we address a dataset by manipulating instances from
two datasets by removing necessary tools or essential information for tool
invocation. Our experiments show that LLMs often struggle to identify the
absence of information required to utilize specific tools and recognize the
absence of appropriate tools. We further analyze model behaviors in different
environments and compare their performance against humans. Our research can
contribute to advancing reliable LLMs by addressing common scenarios during
interactions between humans and LLMs. Our code and dataset will be publicly
available.
♻ ☆ Is In-Context Learning Sufficient for Instruction Following in LLMs? ICLR 2025
In-context learning (ICL) allows LLMs to learn from examples without changing
their weights: this is a particularly promising capability for long-context
LLMs that can potentially learn from many examples. Recently, Lin et al. (2024)
proposed URIAL, a method using only three in-context examples to align base
LLMs, achieving non-trivial instruction following performance. In this work, we
show that, while effective, ICL alignment with URIAL still underperforms
compared to instruction fine-tuning on the established benchmark MT-Bench,
especially with more capable base LLMs. We then uncover the most relevant
elements for successful in-context alignment, finding the crucial role of the
decoding parameters. Based on these insights, we show that the approach of
URIAL can indeed be improved by adding high-quality, potentially carefully
selected via greedy search, demonstrations in context, getting closer to the
performance of instruct models. Finally, we provide the first, to our
knowledge, systematic comparison of ICL and instruction fine-tuning (IFT) for
instruction following in the low data regime, where ICL can be a viable
alternative to IFT. Overall, our work advances the understanding of ICL as an
alignment technique and its relationship to IFT. We provide our code at
https://github.com/tml-epfl/icl-alignment.
comment: Accepted at ICLR 2025. This camera-ready version v3 adds multi-turn
alignment via ICL, revisiting main results on instruct models, and simple
mechanistic study. Updates in the v2: experiment with decoding schemes,
scaling in-context alignment, ICL vs IFT for instruction following. Code at
https://github.com/tml-epfl/icl-alignment
♻ ☆ The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination
Contrastive decoding strategies are widely used to reduce hallucinations in
multimodal large language models (MLLMs). These methods work by constructing
contrastive samples to induce hallucinations and then suppressing them in the
output distribution. However, this paper demonstrates that such approaches fail
to effectively mitigate the hallucination problem. The performance improvements
observed on POPE Benchmark are largely driven by two misleading factors: (1)
crude, unidirectional adjustments to the model's output distribution and (2)
the adaptive plausibility constraint, which reduces the sampling strategy to
greedy search. To further illustrate these issues, we introduce a series of
spurious improvement methods and evaluate their performance against contrastive
decoding techniques. Experimental results reveal that the observed performance
gains in contrastive decoding are entirely unrelated to its intended goal of
mitigating hallucinations. Our findings challenge common assumptions about the
effectiveness of contrastive decoding strategies and pave the way for
developing genuinely effective solutions to hallucinations in MLLMs.
♻ ☆ Argumentative Large Language Models for Explainable and Contestable Claim Verification AAAI 2025
The profusion of knowledge encoded in large language models (LLMs) and their
ability to apply this knowledge zero-shot in a range of settings makes them
promising candidates for use in decision-making. However, they are currently
limited by their inability to provide outputs which can be faithfully explained
and effectively contested to correct mistakes. In this paper, we attempt to
reconcile these strengths and weaknesses by introducing \emph{argumentative
LLMs (ArgLLMs)}, a method for augmenting LLMs with argumentative reasoning.
Concretely, ArgLLMs construct argumentation frameworks, which then serve as the
basis for formal reasoning in support of decision-making. The interpretable
nature of these argumentation frameworks and formal reasoning means that any
decision made by ArgLLMs may be explained and contested. We evaluate ArgLLMs'
performance experimentally in comparison with state-of-the-art techniques, in
the context of the decision-making task of claim verification. We also define
novel properties to characterise contestability and assess ArgLLMs formally in
terms of these properties.
comment: 18 pages, 18 figures. Accepted as an oral presentation at AAAI 2025
♻ ☆ Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant
The dream of achieving a student-teacher ratio of 1:1 is closer than ever
thanks to the emergence of large language models (LLMs). One potential
application of these models in the educational field would be to provide
feedback to students in university introductory programming courses, so that a
student struggling to solve a basic implementation problem could seek help from
an LLM available 24/7. This article focuses on studying three aspects related
to such an application. First, the performance of two well-known models,
GPT-3.5T and GPT-4T, in providing feedback to students is evaluated. The
empirical results showed that GPT-4T performs much better than GPT-3.5T,
however, it is not yet ready for use in a real-world scenario. This is due to
the possibility of generating incorrect information that potential users may
not always be able to detect. Second, the article proposes a carefully designed
prompt using in-context learning techniques that allows automating important
parts of the evaluation process, as well as providing a lower bound for the
fraction of feedbacks containing incorrect information, saving time and effort.
This was possible because the resulting feedback has a programmatically
analyzable structure that incorporates diagnostic information about the LLM's
performance in solving the requested task. Third, the article also suggests a
possible strategy for implementing a practical learning tool based on LLMs,
which is rooted on the proposed prompting techniques. This strategy opens up a
whole range of interesting possibilities from a pedagogical perspective.
♻ ☆ Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation
Ambiguous words are often found in modern digital communications. Lexical
ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due
to limited data. Consequently, the efficiency of translation, information
retrieval, and question-answering systems is hindered by these limitations.
This study investigates the use of Large Language Models (LLMs) to improve WSD
using a novel approach combining a systematic prompt augmentation mechanism
with a knowledge base (KB) consisting of different sense interpretations. The
proposed method incorporates a human-in-loop approach for prompt augmentation
where prompt is supported by Part-of-Speech (POS) tagging, synonyms of
ambiguous words, aspect-based sense filtering and few-shot prompting to guide
the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based
approach, this work demonstrates a substantial improvement in performance. The
evaluation was conducted using FEWS test data and sense tags. This research
advances accurate word interpretation in social media and digital
communication.
comment: 12 pages,6 tables, 1 figure, Proceedings of the 1st International
Conference on NLP & AI for Cyber Security
♻ ☆ Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models
Artem Vazhentsev, Lyudmila Rvanova, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov
Uncertainty quantification (UQ) is a prominent approach for eliciting
truthful answers from large language models (LLMs). To date, information-based
and consistency-based UQ have been the dominant UQ methods for text generation
via LLMs. Density-based methods, despite being very effective for UQ in text
classification with encoder-based models, have not been very successful with
generative LLMs. In this work, we adapt Mahalanobis Distance (MD) - a
well-established UQ technique in classification tasks - for text generation and
introduce a new supervised UQ method. Our method extracts token embeddings from
multiple layers of LLMs, computes MD scores for each token, and uses linear
regression trained on these features to provide robust uncertainty scores.
Through extensive experiments on eleven datasets, we demonstrate that our
approach substantially improves over existing UQ methods, providing accurate
and computationally efficient uncertainty scores for both sequence-level
selective generation and claim-level fact-checking tasks. Our method also
exhibits strong generalization to out-of-domain data, making it suitable for a
wide range of LLM-based applications.
♻ ☆ Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection
Stories are a fundamental aspect of human experience. Engaging deeply with
stories and spotting plot holes -- inconsistencies in a storyline that break
the internal logic or rules of a story's world -- requires nuanced reasoning
skills, including tracking entities and events and their interplay, abstract
thinking, pragmatic narrative understanding, commonsense and social reasoning,
and theory of mind. As Large Language Models (LLMs) increasingly generate,
interpret, and modify text, rigorously assessing their narrative consistency
and deeper language understanding becomes critical. However, existing
benchmarks focus mainly on surface-level comprehension. In this work, we
propose plot hole detection in stories as a proxy to evaluate language
understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel
algorithm to controllably and carefully synthesize plot holes in human-written
stories. Using this algorithm, we construct a benchmark to evaluate LLMs' plot
hole detection abilities in stories -- FlawedFictions -- , which is robust to
contamination, with human filtering ensuring high quality. We find that
state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless
of the reasoning effort allowed, with performance significantly degrading as
story length increases. Finally, we show that LLM-based story summarization and
story generation are prone to introducing plot holes, with more than 50% and
100% increases in plot hole detection rates with respect to human-written
originals.
comment: Preprint
♻ ☆ EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen
Human speech goes beyond the mere transfer of information; it is a profound
exchange of emotions and a connection between individuals. While Text-to-Speech
(TTS) models have made huge progress, they still face challenges in controlling
the emotional expression in the generated speech. In this work, we propose
EmoVoice, a novel emotion-controllable TTS model that exploits large language
models (LLMs) to enable fine-grained freestyle natural language emotion
control, and a phoneme boost variant design that makes the model output phoneme
tokens and audio tokens in parallel to enhance content consistency, inspired by
chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we
introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring
expressive speech and fine-grained emotion labels with natural language
descriptions. EmoVoice achieves state-of-the-art performance on the English
EmoVoice-DB test set using only synthetic training data, and on the Chinese
Secap test set using our in-house data. We further investigate the reliability
of existing emotion evaluation metrics and their alignment with human
perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and
Gemini to assess emotional speech. Demo samples are available at
https://anonymous.4open.science/r/EmoVoice-DF55. Dataset, code, and checkpoints
will be released.
♻ ☆ Spin glass model of in-context learning
Large language models show a surprising in-context learning ability -- being
able to use a prompt to form a prediction for a query, yet without additional
training, in stark contrast to old-fashioned supervised learning. Providing a
mechanistic interpretation and linking the empirical phenomenon to physics are
thus challenging and remain unsolved. We study a simple yet expressive
transformer with linear attention and map this structure to a spin glass model
with real-valued spins, where the couplings and fields explain the intrinsic
disorder in data. The spin glass model explains how the weight parameters
interact with each other during pre-training, and further clarifies why an
unseen function can be predicted by providing only a prompt yet without further
training. Our theory reveals that for single-instance learning, increasing the
task diversity leads to the emergence of in-context learning, by allowing the
Boltzmann distribution to converge to a unique correct solution of weight
parameters. Therefore the pre-trained transformer displays a prediction power
in a novel prompt setting. The proposed analytically tractable model thus
offers a promising avenue for thinking about how to interpret many intriguing
but puzzling properties of large language models.
comment: 16 pages, 4+6 figures, revised version to the journal
♻ ☆ StaICC: Standardized Evaluation for Classification Task in In-context Learning
Classification tasks are widely investigated in the In-Context Learning (ICL)
paradigm. However, current efforts are evaluated on disjoint benchmarks and
settings, while their performances are significantly influenced by some trivial
variables, such as prompt templates, data sampling, instructions, etc., which
leads to significant inconsistencies in the results reported across various
literature, preventing fair comparison or meta-analysis across different
papers. Therefore, this paper proposes a standardized and easy-to-use
evaluation toolkit (StaICC) for in-context classification. Including, for the
normal classification task, we provide StaICC-Normal, selecting 10 widely used
datasets, and generating prompts with a fixed form, to mitigate the variance
among the experiment implementations. To enrich the usage of our benchmark, we
also provide a sub-benchmark StaICC-Diag for diagnosing ICL from several
aspects, aiming for a more robust inference processing.
comment: 20 pages, 8 figures, 8 tables
♻ ☆ Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization
Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He
This study addresses the challenge of noise in training datasets for Direct
Preference Optimization (DPO), a method for aligning Large Language Models
(LLMs) with human preferences. We categorize noise into pointwise noise, which
includes low-quality data points, and pairwise noise, which encompasses
erroneous data pair associations that affect preference rankings. Utilizing
Distributionally Robust Optimization (DRO), we enhance DPO's resilience to
these types of noise. Our theoretical insights reveal that DPO inherently
embeds DRO principles, conferring robustness to pointwise noise, with the
regularization coefficient $\beta$ playing a critical role in its noise
resistance. Extending this framework, we introduce Distributionally
Robustifying DPO (Dr. DPO), which integrates pairwise robustness by optimizing
against worst-case pairwise scenarios. The novel hyperparameter $\beta'$ in Dr.
DPO allows for fine-tuned control over data pair reliability, providing a
strategic balance between exploration and exploitation in noisy training
environments. Empirical evaluations demonstrate that Dr. DPO substantially
improves the quality of generated text and response accuracy in preference
datasets, showcasing enhanced performance in both noisy and noise-free
settings. The code is available at https://github.com/junkangwu/Dr_DPO.
♻ ☆ SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding ICLR 2025
Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai
Scientific literature understanding is crucial for extracting targeted
information and garnering insights, thereby significantly advancing scientific
discovery. Despite the remarkable success of Large Language Models (LLMs), they
face challenges in scientific literature understanding, primarily due to (1) a
lack of scientific knowledge and (2) unfamiliarity with specialized scientific
tasks.
To develop an LLM specialized in scientific literature understanding, we
propose a hybrid strategy that integrates continual pre-training (CPT) and
supervised fine-tuning (SFT), to simultaneously infuse scientific domain
knowledge and enhance instruction-following capabilities for domain-specific
tasks.cIn this process, we identify two key challenges: (1) constructing
high-quality CPT corpora, and (2) generating diverse SFT instructions. We
address these challenges through a meticulous pipeline, including PDF text
extraction, parsing content error correction, quality filtering, and synthetic
instruction creation. Applying this strategy, we present a suite of LLMs:
SciLitLLM, specialized in scientific literature understanding. These models
demonstrate promising performance on scientific literature understanding
benchmarks.
Our contributions are threefold: (1) We present an effective framework that
integrates CPT and SFT to adapt LLMs to scientific literature understanding,
which can also be easily adapted to other domains. (2) We propose an LLM-based
synthesis method to generate diverse and high-quality scientific instructions,
resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning
in less-represented scientific domains. (3) SciLitLLM achieves promising
performance improvements on scientific literature understanding benchmarks.
comment: ICLR 2025
♻ ☆ Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts
With the advent of Large Language Models (LLMs), generating rule-based data
for real-world applications has become more accessible. Due to the inherent
ambiguity of natural language and the complexity of rule sets, especially in
long contexts, LLMs often struggle to follow all specified rules, frequently
omitting at least one. To enhance the reasoning and understanding of LLMs on
long and complex contexts, we propose a novel prompting strategy Multi-Lingual
Prompt, namely MLPrompt, which automatically translates the error-prone rule
that an LLM struggles to follow into another language, thus drawing greater
attention to it. Experimental results on public datasets across various tasks
have shown MLPrompt can outperform state-of-the-art prompting methods such as
Chain of Thought, Tree of Thought, and Self-Consistency. Additionally, we
introduce a framework integrating MLPrompt with an auto-checking mechanism for
structured data generation, with a specific case study in text-to-MIP
instances. Further, we extend the proposed framework for text-to-SQL to
demonstrate its generation ability towards structured data synthesis.
♻ ☆ Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection WWW'25
The spread of fake news harms individuals and presents a critical social
challenge that must be addressed. Although numerous algorithmic and insightful
features have been developed to detect fake news, many of these features can be
manipulated with style-conversion attacks, especially with the emergence of
advanced language models, making it more difficult to differentiate from
genuine news. This study proposes adversarial style augmentation, AdStyle,
designed to train a fake news detector that remains robust against various
style-conversion attacks. The primary mechanism involves the strategic use of
LLMs to automatically generate a diverse and coherent array of style-conversion
attack prompts, enhancing the generation of particularly challenging prompts
for the detector. Experiments indicate that our augmentation strategy
significantly improves robustness and detection performance when evaluated on
fake news benchmark datasets.
comment: WWW'25 research track accepted
♻ ☆ Semantic Matters: Multimodal Features for Affective Analysis
In this study, we present our methodology for two tasks: the Emotional
Mimicry Intensity (EMI) Estimation Challenge and the Behavioural
Ambivalence/Hesitancy (BAH) Recognition Challenge, both conducted as part of
the 8th Workshop and Competition on Affective & Behavior Analysis in-the-wild.
We utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to
extract various audio features, capturing both linguistic and paralinguistic
information. Our approach incorporates a valence-arousal-dominance (VAD) module
derived from Wav2Vec 2.0, a BERT text encoder, and a vision transformer (ViT)
with predictions subsequently processed through a long short-term memory (LSTM)
architecture or a convolution-like method for temporal modeling. We integrate
the textual and visual modality into our analysis, recognizing that semantic
content provides valuable contextual cues and underscoring that the meaning of
speech often conveys more critical insights than its acoustic counterpart
alone. Fusing in the vision modality helps in some cases to interpret the
textual modality more precisely. This combined approach results in significant
performance improvements, achieving in EMI $\rho_{\text{TEST}} = 0.706$ and in
BAH $F1_{\text{TEST}} = 0.702$, securing first place in the EMI challenge and
second place in the BAH challenge.
♻ ☆ Large Language Model-Based Knowledge Graph System Construction for Sustainable Development Goals: An AI-Based Speculative Design Perspective
From 2000 to 2015, the UN's Millennium Development Goals guided global
priorities. The subsequent Sustainable Development Goals (SDGs) adopted a more
dynamic approach, with annual indicator updates. As 2030 nears and progress
lags, innovative acceleration strategies are critical. This study develops an
AI-powered knowledge graph system to analyze SDG interconnections, discover
potential new goals, and visualize them online. Using official SDG texts,
Elsevier's keyword dataset, and 1,127 TED Talk transcripts (2020.01-2024.04), a
pilot on 269 talks from 2023 applies AI-speculative design, large language
models, and retrieval-augmented generation. Key findings include: (1) Heatmap
analysis reveals strong associations between Goal 10 and Goal 16, and minimal
coverage of Goal 6. (2) In the knowledge graph, simulated dialogue over time
reveals new central nodes, showing how richer data supports divergent thinking
and goal clarity. (3) Six potential new goals are proposed, centered on equity,
resilience, and technology-driven inclusion. This speculative-AI framework
offers fresh insights for policymakers and lays groundwork for future
multimodal and cross-system SDG applications.
comment: This is a minor revision: fixed a typo in the abstract (time range)
and corrected minor textual errors
♻ ☆ ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation
Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely
primarily on parametric knowledge, limiting factual accuracy. While recent
works equip reinforcement learning (RL)-based LRMs with retrieval capabilities,
they suffer from overthinking and lack robustness in reasoning, reducing their
effectiveness in question answering (QA) tasks. To address this, we propose
ReaRAG, a factuality-enhanced reasoning model that explores diverse queries
without excessive iterations. Our solution includes a novel data construction
framework with an upper bound on the reasoning chain length. Specifically, we
first leverage an LRM to generate deliberate thinking, then select an action
from a predefined action space (Search and Finish). For Search action, a query
is executed against the RAG engine, where the result is returned as observation
to guide reasoning steps later. This process iterates until a Finish action is
chosen. Benefiting from ReaRAG's strong reasoning capabilities, our approach
outperforms existing baselines on multi-hop QA. Further analysis highlights its
strong reflective ability to recognize errors and refine its reasoning
trajectory. Our study enhances LRMs' factuality while effectively integrating
robust reasoning for Retrieval-Augmented Generation (RAG).
♻ ☆ DocAgent: A Multi-Agent System for Automated Code Documentation Generation
High-quality code documentation is crucial for software development
especially in the era of AI. However, generating it automatically using Large
Language Models (LLMs) remains challenging, as existing approaches often
produce incomplete, unhelpful, or factually incorrect outputs. We introduce
DocAgent, a novel multi-agent collaborative system using topological code
processing for incremental context building. Specialized agents (Reader,
Searcher, Writer, Verifier, Orchestrator) then collaboratively generate
documentation. We also propose a multi-faceted evaluation framework assessing
Completeness, Helpfulness, and Truthfulness. Comprehensive experiments show
DocAgent significantly outperforms baselines consistently. Our ablation study
confirms the vital role of the topological processing order. DocAgent offers a
robust approach for reliable code documentation generation in complex and
proprietary repositories.
comment: Public Repo: https://github.com/facebookresearch/DocAgent
♻ ☆ Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models
Large language models (LLMs) are foundational explorations to artificial
general intelligence, yet their alignment with human values via instruction
tuning and preference learning achieves only superficial compliance. Here, we
demonstrate that harmful knowledge embedded during pretraining persists as
indelible "dark patterns" in LLMs' parametric memory, evading alignment
safeguards and resurfacing under adversarial inducement at distributional
shifts. In this study, we first theoretically analyze the intrinsic ethical
vulnerability of aligned LLMs by proving that current alignment methods yield
only local "safety regions" in the knowledge manifold. In contrast, pretrained
knowledge remains globally connected to harmful concepts via high-likelihood
adversarial trajectories. Building on this theoretical insight, we empirically
validate our findings by employing semantic coherence inducement under
distributional shifts--a method that systematically bypasses alignment
constraints through optimized adversarial prompts. This combined theoretical
and empirical approach achieves a 100% attack success rate across 19 out of 23
state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealing
their universal vulnerabilities.
♻ ☆ Assessing Judging Bias in Large Reasoning Models: An Empirical Study
Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, Bingsheng He
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have
demonstrated remarkable reasoning capabilities, raising important questions
about their biases in LLM-as-a-judge settings. We present a comprehensive
benchmark comparing judging biases between LLMs and LRMs across both subjective
preference-alignment datasets and objective fact-based datasets. Through
investigation of bandwagon, authority, position, and distraction biases, we
uncover four key findings: (1) despite their advanced reasoning capabilities,
LRMs remain susceptible to the above biases; (2) LRMs demonstrate better
robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit
notable position bias, preferring options in later positions; and (4) we
identify a novel "superficial reflection bias" where phrases mimicking
reasoning (e.g., "wait, let me think...") significantly influence model
judgments. To address these biases, we design and evaluate three mitigation
strategies: specialized system prompts that reduce judging biases by up to 19\%
in preference alignment datasets and 14\% in fact-related datasets, in-context
learning that provides up to 27\% improvement on preference tasks but shows
inconsistent results on factual tasks, and a self-reflection mechanism that
reduces biases by up to 10\% in preference datasets and 16\% in fact-related
datasets, with self-reflection proving particularly effective for LRMs. Our
work provides crucial insights for developing more reliable LLM-as-a-Judge
frameworks, especially as LRMs become increasingly deployed as automated
judges.
♻ ☆ Where is the answer? Investigating Positional Bias in Language Model Knowledge Extraction
Large language models require updates to remain up-to-date or adapt to new
domains by fine-tuning them with new documents. One key is memorizing the
latest information in a way that the memorized information is extractable with
a query prompt. However, LLMs suffer from a phenomenon called perplexity curse;
despite minimizing document perplexity during fine-tuning, LLMs struggle to
extract information through a prompt sentence. In this new knowledge
acquisition and extraction, we find a very intriguing fact that LLMs can
accurately answer questions about the first sentence, but they struggle to
extract information described in the middle or end of the documents used for
fine-tuning. Our study suggests that the auto-regressive training causes this
issue; each token is prompted by reliance on all previous tokens, which hinders
the model from recalling information from training documents by question
prompts. To conduct the in-depth study, we publish both synthetic and real
datasets, enabling the evaluation of the QA performance w.r.t. the position of
the corresponding answer in a document. Our investigation shows that even a
large model suffers from the perplexity curse, but regularization such as
denoising auto-regressive loss can enhance the information extraction from
diverse positions. These findings will be (i) a key to improving knowledge
extraction from LLMs and (ii) new elements to discuss the trade-off between RAG
and fine-tuning in adapting LLMs to a new domain.
comment: Code is published at https://github.com/omron-sinicx/WhereIsTheAnswer