Computation and Language 129
☆ LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Salman Khan, Fahad Shahbaz Khan
Large Language Models (LLMs) have transformed the natural language processing
landscape and brought to life diverse applications. Pretraining on vast
web-scale data has laid the foundation for these models, yet the research
community is now increasingly shifting focus toward post-training techniques to
achieve further breakthroughs. While pretraining provides a broad linguistic
foundation, post-training methods enable LLMs to refine their knowledge,
improve reasoning, enhance factual accuracy, and align more effectively with
user intents and ethical considerations. Fine-tuning, reinforcement learning,
and test-time scaling have emerged as critical strategies for optimizing LLMs
performance, ensuring robustness, and improving adaptability across various
real-world tasks. This survey provides a systematic exploration of
post-training methodologies, analyzing their role in refining LLMs beyond
pretraining, addressing key challenges such as catastrophic forgetting, reward
hacking, and inference-time trade-offs. We highlight emerging directions in
model alignment, scalable adaptation, and inference-time reasoning, and outline
future research directions. We also provide a public repository to continually
track developments in this fast-evolving field:
https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.
comment: 31 pages, 7 figures, 3 tables, 375 references
☆ Identifying Emerging Concepts in Large Corpora
We introduce a new method to identify emerging concepts in large text
corpora. By analyzing changes in the heatmaps of the underlying embedding
space, we are able to detect these concepts with high accuracy shortly after
they originate, in turn outperforming common alternatives. We further
demonstrate the utility of our approach by analyzing speeches in the U.S.
Senate from 1941 to 2015. Our results suggest that the minority party is more
active in introducing new concepts into the Senate discourse. We also identify
specific concepts that closely correlate with the Senators' racial, ethnic, and
gender identities. An implementation of our method is publicly available.
comment: 9 pages, 4 figures
☆ FANformer: Improving Large Language Models Through Effective Periodicity Modeling
Yihong Dong, Ge Li, Xue Jiang, Yongding Tao, Kechi Zhang, Hao Zhu, Huanyu Liu, Jiazheng Ding, Jia Li, Jinliang Deng, Hong Mei
Periodicity, as one of the most important basic characteristics, lays the
foundation for facilitating structured knowledge acquisition and systematic
cognitive processes within human learning paradigms. However, the potential
flaws of periodicity modeling in Transformer affect the learning efficiency and
establishment of underlying principles from data for large language models
(LLMs) built upon it. In this paper, we demonstrate that integrating effective
periodicity modeling can improve the learning efficiency and performance of
LLMs. We introduce FANformer, which integrates Fourier Analysis Network (FAN)
into attention mechanism to achieve efficient periodicity modeling, by
modifying the feature projection process of attention mechanism. Extensive
experimental results on language modeling show that FANformer consistently
outperforms Transformer when scaling up model size and training tokens,
underscoring its superior learning efficiency. To further validate the
effectiveness of FANformer, we pretrain a FANformer-1B on 1 trillion tokens.
FANformer-1B exhibits marked improvements on downstream tasks compared to
open-source LLMs with similar model parameters or training tokens. The results
position FANformer as an effective and promising architecture for advancing
LLMs.
☆ Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of Mind
Persuasive dialogue plays a pivotal role in human communication, influencing
various domains. Recent persuasive dialogue datasets often fail to align with
real-world interpersonal interactions, leading to unfaithful representations.
For instance, unrealistic scenarios may arise, such as when the persuadee
explicitly instructs the persuader on which persuasion strategies to employ,
with each of the persuadee's questions corresponding to a specific strategy for
the persuader to follow. This issue can be attributed to a violation of the
"Double Blind" condition, where critical information is fully shared between
participants. In actual human interactions, however, key information such as
the mental state of the persuadee and the persuasion strategies of the
persuader is not directly accessible. The persuader must infer the persuadee's
mental state using Theory of Mind capabilities and construct arguments that
align with the persuadee's motivations. To address this gap, we introduce
ToMMA, a novel multi-agent framework for dialogue generation that is guided by
causal Theory of Mind. This framework ensures that information remains
undisclosed between agents, preserving "double-blind" conditions, while causal
ToM directs the persuader's reasoning, enhancing alignment with human-like
persuasion dynamics. Consequently, we present CToMPersu, a multi-domain,
multi-turn persuasive dialogue dataset that tackles both double-blind and
logical coherence issues, demonstrating superior performance across multiple
metrics and achieving better alignment with real human dialogues. Our dataset
and prompts are available at https://github.com/DingyiZhang/ToMMA-CToMPersu .
comment: 23pages
☆ Token-level Ensembling of Models with Different Vocabularies
Model ensembling is a technique to combine the predicted distributions of two
or more models, often leading to improved robustness and performance. For
ensembling in text generation, the next token's probability distribution is
derived from a weighted sum of the distributions of each individual model. This
requires the underlying models to share the same subword vocabulary, limiting
the applicability of ensembling, since many open-sourced models have distinct
vocabularies. In research settings, experimentation or upgrades to vocabularies
may introduce multiple vocabulary sizes. This paper proposes an inference-time
only algorithm that allows for ensembling models with different vocabularies,
without the need to learn additional parameters or alter the underlying models.
Instead, the algorithm ensures that tokens generated by the ensembled models
\textit{agree} in their surface form. We apply this technique to combinations
of traditional encoder-decoder models and decoder-only LLMs and evaluate on
machine translation. In addition to expanding to model pairs that were
previously incapable of token-level ensembling, our algorithm frequently
improves translation performance over either model individually.
comment: Under review
☆ RuCCoD: Towards Automated ICD Coding in Russian
Aleksandr Nesterov, Andrey Sakhovskiy, Ivan Sviridov, Airat Valiev, Vladimir Makharev, Petr Anokhin, Galina Zubkova, Elena Tutubalina
This study investigates the feasibility of automating clinical coding in
Russian, a language with limited biomedical resources. We present a new dataset
for ICD coding, which includes diagnosis fields from electronic health records
(EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD
codes. This dataset serves as a benchmark for several state-of-the-art models,
including BERT, LLaMA with LoRA, and RAG, with additional experiments examining
transfer learning across domains (from PubMed abstracts to medical diagnosis)
and terminologies (from UMLS concepts to ICD codes). We then apply the
best-performing model to label an in-house EHR dataset containing patient
histories from 2017 to 2021. Our experiments, conducted on a carefully curated
test set, demonstrate that training with the automated predicted codes leads to
a significant improvement in accuracy compared to manually annotated data from
physicians. We believe our findings offer valuable insights into the potential
for automating clinical coding in resource-limited languages like Russian,
which could enhance clinical efficiency and data accuracy in these contexts.
☆ Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs
Large language models (LLMs) have demonstrated remarkable performance across
diverse tasks by encoding vast amounts of factual knowledge. However, they are
still prone to hallucinations, generating incorrect or misleading information,
often accompanied by high uncertainty. Existing methods for hallucination
detection primarily focus on quantifying internal uncertainty, which arises
from missing or conflicting knowledge within the model. However, hallucinations
can also stem from external uncertainty, where ambiguous user queries lead to
multiple possible interpretations. In this work, we introduce Semantic Volume,
a novel mathematical measure for quantifying both external and internal
uncertainty in LLMs. Our approach perturbs queries and responses, embeds them
in a semantic space, and computes the determinant of the Gram matrix of the
embedding vectors, capturing their dispersion as a measure of uncertainty. Our
framework provides a generalizable and unsupervised uncertainty detection
method without requiring white-box access to LLMs. We conduct extensive
experiments on both external and internal uncertainty detection, demonstrating
that our Semantic Volume method consistently outperforms existing baselines in
both tasks. Additionally, we provide theoretical insights linking our measure
to differential entropy, unifying and extending previous sampling-based
uncertainty measures such as the semantic entropy. Semantic Volume is shown to
be a robust and interpretable approach to improving the reliability of LLMs by
systematically detecting uncertainty in both user queries and model responses.
☆ Transforming Tuberculosis Care: Optimizing Large Language Models For Enhanced Clinician-Patient Communication AAAI-25
Daniil Filienko, Mahek Nizar, Javier Roberti, Denise Galdamez, Haroon Jakher, Sarah Iribarren, Weichao Yuwen, Martine De Cock
Tuberculosis (TB) is the leading cause of death from an infectious disease
globally, with the highest burden in low- and middle-income countries. In these
regions, limited healthcare access and high patient-to-provider ratios impede
effective patient support, communication, and treatment completion. To bridge
this gap, we propose integrating a specialized Large Language Model into an
efficacious digital adherence technology to augment interactive communication
with treatment supporters. This AI-powered approach, operating within a
human-in-the-loop framework, aims to enhance patient engagement and improve TB
treatment outcomes.
comment: GenAI4Health at AAAI-25
☆ ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer
Omer Goldman, Uri Shaham, Dan Malkin, Sivan Eiger, Avinatan Hassidim, Yossi Matias, Joshua Maynez, Adi Mayrav Gilady, Jason Riesa, Shruti Rijhwani, Laura Rimell, Idan Szpektor, Reut Tsarfaty, Matan Eyal
To achieve equitable performance across languages, multilingual large
language models (LLMs) must be able to abstract knowledge beyond the language
in which it was acquired. However, the current literature lacks reliable ways
to measure LLMs' capability of cross-lingual knowledge transfer. To that end,
we present ECLeKTic, a multilingual closed-book QA (CBQA) dataset that
Evaluates Cross-Lingual Knowledge Transfer in a simple, black-box manner. We
detected information with uneven coverage across languages by controlling for
presence and absence of Wikipedia articles in 12 languages. We generated
knowledge-seeking questions in a source language, for which the answer appears
in a relevant Wikipedia article and translated them to all other 11 languages,
for which the respective Wikipedias lack equivalent articles. Assuming that
Wikipedia reflects the prominent knowledge in the LLM's training data, to solve
ECLeKTic's CBQA task the model is required to transfer knowledge between
languages. Experimenting with 8 LLMs, we show that SOTA models struggle to
effectively share knowledge across, languages even if they can predict the
answer well for queries in the same language the knowledge was acquired in.
☆ Detecting Linguistic Diversity on Social Media
This chapter explores the efficacy of using social media data to examine
changing linguistic behaviour of a place. We focus our investigation on
Aotearoa New Zealand where official statistics from the census is the only
source of language use data. We use published census data as the ground truth
and the social media sub-corpus from the Corpus of Global Language Use as our
alternative data source. We use place as the common denominator between the two
data sources. We identify the language conditions of each tweet in the social
media data set and validated our results with two language identification
models. We then compare levels of linguistic diversity at national, regional,
and local geographies. The results suggest that social media language data has
the possibility to provide a rich source of spatial and temporal insights on
the linguistic profile of a place. We show that social media is sensitive to
demographic and sociopolitical changes within a language and at low-level
regional and local geographies.
comment: Accepted to Cartography and GIScience in Australasia and Oceania:
Including twenty years of GeoCart
☆ Optimizing Large Language Models for ESG Activity Detection in Financial Texts
The integration of Environmental, Social, and Governance (ESG) factors into
corporate decision-making is a fundamental aspect of sustainable finance.
However, ensuring that business practices align with evolving regulatory
frameworks remains a persistent challenge. AI-driven solutions for
automatically assessing the alignment of sustainability reports and
non-financial disclosures with specific ESG activities could greatly support
this process. Yet, this task remains complex due to the limitations of
general-purpose Large Language Models (LLMs) in domain-specific contexts and
the scarcity of structured, high-quality datasets. In this paper, we
investigate the ability of current-generation LLMs to identify text related to
environmental activities. Furthermore, we demonstrate that their performance
can be significantly enhanced through fine-tuning on a combination of original
and synthetically generated data. To this end, we introduce ESG-Activities, a
benchmark dataset containing 1,325 labelled text segments classified according
to the EU ESG taxonomy. Our experimental results show that fine-tuning on
ESG-Activities significantly enhances classification accuracy, with open models
such as Llama 7B and Gemma 7B outperforming large proprietary solutions in
specific configurations. These findings have important implications for
financial analysts, policymakers, and AI researchers seeking to enhance ESG
transparency and compliance through advanced natural language processing
techniques.
☆ Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation
Clinical cohort definition is crucial for patient recruitment and
observational studies, yet translating inclusion/exclusion criteria into SQL
queries remains challenging and manual. We present an automated system
utilizing large language models that combines criteria parsing, two-level
retrieval augmented generation with specialized knowledge bases, medical
concept standardization, and SQL generation to retrieve patient cohorts with
patient funnels. The system achieves 0.75 F1-score in cohort identification on
EHR data, effectively capturing complex temporal and logical relationships.
These results demonstrate the feasibility of automated cohort generation for
epidemiological research.
comment: 7 pages, 1 figure
☆ Re-evaluating Theory of Mind evaluation in large language models
The question of whether large language models (LLMs) possess Theory of Mind
(ToM) -- often defined as the ability to reason about others' mental states --
has sparked significant scientific and public interest. However, the evidence
as to whether LLMs possess ToM is mixed, and the recent growth in evaluations
has not resulted in a convergence. Here, we take inspiration from cognitive
science to re-evaluate the state of ToM evaluation in LLMs. We argue that a
major reason for the disagreement on whether LLMs have ToM is a lack of clarity
on whether models should be expected to match human behaviors, or the
computations underlying those behaviors. We also highlight ways in which
current evaluations may be deviating from "pure" measurements of ToM abilities,
which also contributes to the confusion. We conclude by discussing several
directions for future research, including the relationship between ToM and
pragmatic communication, which could advance our understanding of artificial
systems as well as human cognition.
comment: under review
☆ PASemiQA: Plan-Assisted Agent for Question Answering on Semi-Structured Data with Text and Relational Information
Large language models (LLMs) have shown impressive abilities in answering
questions across various domains, but they often encounter hallucination issues
on questions that require professional and up-to-date knowledge. To address
this limitation, retrieval-augmented generation (RAG) techniques have been
proposed, which retrieve relevant information from external sources to inform
their responses. However, existing RAG methods typically focus on a single type
of external data, such as vectorized text database or knowledge graphs, and
cannot well handle real-world questions on semi-structured data containing both
text and relational information. To bridge this gap, we introduce PASemiQA, a
novel approach that jointly leverages text and relational information in
semi-structured data to answer questions. PASemiQA first generates a plan to
identify relevant text and relational information to answer the question in
semi-structured data, and then uses an LLM agent to traverse the
semi-structured data and extract necessary information. Our empirical results
demonstrate the effectiveness of PASemiQA across different semi-structured
datasets from various domains, showcasing its potential to improve the accuracy
and reliability of question answering systems on semi-structured data.
☆ CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
Chain-of-Thought (CoT) enhances Large Language Models (LLMs) by enabling
step-by-step reasoning in natural language. However, the language space may be
suboptimal for reasoning. While implicit CoT methods attempt to enable
reasoning without explicit CoT tokens, they have consistently lagged behind
explicit CoT method in task performance. We propose CODI (Continuous
Chain-of-Thought via Self-Distillation), a novel framework that distills CoT
into a continuous space, where a shared model acts as both teacher and student,
jointly learning explicit and implicit CoT while aligning their hidden
activation on the token generating the final answer. CODI is the first implicit
CoT method to match explicit CoT's performance on GSM8k while achieving 3.1x
compression, surpassing the previous state-of-the-art by 28.2% in accuracy.
Furthermore, CODI demonstrates scalability, robustness, and generalizability to
more complex CoT datasets. Additionally, CODI retains interpretability by
decoding its continuous thoughts, making its reasoning process transparent. Our
findings establish implicit CoT as not only a more efficient but a powerful
alternative to explicit CoT.
comment: 15 pages
☆ Beyond Words: A Latent Memory Approach to Internal Reasoning in LLMs
Recent advances in large language models (LLMs) have popularized the
chain-of-thought (CoT) paradigm, in which models produce explicit reasoning
steps in natural language. Although this approach improves interpretability and
facilitates external auditing, it may not represent the most computationally
efficient method for internal reasoning. In contrast, human cognition relies on
implicit mental representations that recall past sensory and episodic
information without requiring complete verbalization. In this paper, we propose
a framework that integrates implicit mental representations into the internal
reasoning processes of LLMs. Preliminary experiments indicate that
incorporating an Implicit Memory Module (IMM) into a simple GPT model yields a
reduction of between 35% and 57% in final training loss compared to a regular
GPT baseline. The addition of an explicit interpretability channel (e.g., a
chain-of-thought decoder) is straightforward to implement within this approach.
We outline theoretical foundations, propose technical mechanisms to scale the
memory module, and discuss how these ideas may lead to more efficient and
robust reasoning, with optional future extensions for explicit auditability.
comment: 13 pages, 5 figures
☆ Extending Dense Passage Retrieval with Temporal Information
Temporal awareness is crucial in many information retrieval tasks,
particularly in scenarios where the relevance of documents depends on their
alignment with the query's temporal context. Traditional retrieval methods such
as BM25 and Dense Passage Retrieval (DPR) excel at capturing lexical and
semantic relevance but fall short in addressing time-sensitive queries. To
bridge this gap, we introduce the temporal retrieval model that integrates
explicit temporal signals by incorporating query timestamps and document dates
into the representation space. Our approach ensures that retrieved passages are
not only topically relevant but also temporally aligned with user intent. We
evaluate our approach on two large-scale benchmark datasets, ArchivalQA and
ChroniclingAmericaQA, achieving substantial performance gains over standard
retrieval baselines. In particular, our model improves Top-1 retrieval accuracy
by 6.63% and NDCG@10 by 3.79% on ArchivalQA, while yielding a 9.56% boost in
Top-1 retrieval accuracy and 4.68% in NDCG@10 on ChroniclingAmericaQA.
Additionally, we introduce a time-sensitive negative sampling strategy, which
refines the model's ability to distinguish between temporally relevant and
irrelevant documents during training. Our findings highlight the importance of
explicitly modeling time in retrieval systems and set a new standard for
handling temporally grounded queries.
☆ PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
The ability to understand and predict the mental states of oneself and
others, known as the Theory of Mind (ToM), is crucial for effective social
interactions. Recent research has emerged to evaluate whether Large Language
Models (LLMs) exhibit a form of ToM. Although recent studies have evaluated ToM
in LLMs, existing benchmarks focus predominantly on physical perception with
principles guided by the Sally-Anne test in synthetic stories and
conversations, failing to capture the complex psychological activities of
mental states in real-life social interactions. To mitigate this gap, we
propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of
LLMs in persuasive dialogues. Our framework introduces two categories of
questions: (1) ToM Reasoning, assessing the capacity of LLMs to track evolving
mental states (e.g., desire shifts in persuadees), and (2) ToM Application,
evaluating whether LLMs can take advantage of inferred mental states to select
effective persuasion strategies (e.g., emphasize rarity) and evaluate the
effectiveness of persuasion strategies. Experiments across eight
state-of-the-art LLMs reveal that while models excel on multiple questions,
they struggle to answer questions that need tracking the dynamics and shifts of
mental states and understanding the mental states in the whole dialogue
comprehensively. Our aim with PersuasiveToM is to allow an effective evaluation
of the ToM reasoning ability of LLMs with more focus on complex psychological
activities. Our code is available at
https://github.com/Yu-Fangxu/PersuasiveToM.
☆ Capability Localization: Capabilities Can be Localized rather than Individual Knowledge
Large scale language models have achieved superior performance in tasks
related to natural language processing, however, it is still unclear how model
parameters affect performance improvement. Previous studies assumed that
individual knowledge is stored in local parameters, and the storage form of
individual knowledge is dispersed parameters, parameter layers, or parameter
chains, which are not unified. We found through fidelity and reliability
evaluation experiments that individual knowledge cannot be localized.
Afterwards, we constructed a dataset for decoupling experiments and discovered
the potential for localizing data commonalities. To further reveal this
phenomenon, this paper proposes a Commonality Neuron Localization (CNL) method,
which successfully locates commonality neurons and achieves a neuron overlap
rate of 96.42% on the GSM8K dataset. Finally, we have demonstrated through
cross data experiments that commonality neurons are a collection of capability
neurons that possess the capability to enhance performance. Our code is
available at https://github.com/nlpkeg/Capability-Neuron-Localization.
☆ Merging Clinical Knowledge into Large Language Models for Medical Research and Applications: A Survey
Clinical knowledge is the collection of information learned from studies on
the causes, prognosis, diagnosis, and treatment of diseases. This type of
knowledge can improve curing performances, and promote physical health. With
the emergence of large language models (LLMs), medical artificial intelligence
(medical AI), which aims to apply academic medical AI systems to real-world
medical scenarios, has entered a new age of development, resulting in excellent
works such as DoctorGPT and Pangu-Drug from academic and industrial researches.
However, the field lacks a comprehensive compendium and comparison of building
medical AI systems from academia and industry. Therefore, this survey focuses
on the building paradigms of medical AI systems including the use of clinical
databases, datasets, training pipelines, integrating medical knowledge graphs,
system applications, and evaluation systems. We hope that this survey can help
relevant practical researchers understand the current performance of academic
models in various fields of healthcare, as well as the potential problems and
future directions for implementing these scientific achievements.
☆ UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
SemEval-2025 Task 1 focuses on ranking images based on their alignment with a
given nominal compound that may carry idiomatic meaning in both English and
Brazilian Portuguese. To address this challenge, this work uses generative
large language models (LLMs) and multilingual CLIP models to enhance idiomatic
compound representations. LLMs generate idiomatic meanings for potentially
idiomatic compounds, enriching their semantic interpretation. These meanings
are then encoded using multilingual CLIP models, serving as representations for
image ranking. Contrastive learning and data augmentation techniques are
applied to fine-tune these embeddings for improved performance. Experimental
results show that multimodal representations extracted through this method
outperformed those based solely on the original nominal compounds. The
fine-tuning approach shows promising outcomes but is less effective than using
embeddings without fine-tuning. The source code used in this paper is available
at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.
☆ Set-Theoretic Compositionality of Sentence Embeddings
Sentence encoders play a pivotal role in various NLP tasks; hence, an
accurate evaluation of their compositional properties is paramount. However,
existing evaluation methods predominantly focus on goal task-specific
performance. This leaves a significant gap in understanding how well sentence
embeddings demonstrate fundamental compositional properties in a
task-independent context. Leveraging classical set theory, we address this gap
by proposing six criteria based on three core "set-like"
compositions/operations: \textit{TextOverlap}, \textit{TextDifference}, and
\textit{TextUnion}. We systematically evaluate $7$ classical and $9$ Large
Language Model (LLM)-based sentence encoders to assess their alignment with
these criteria. Our findings show that SBERT consistently demonstrates set-like
compositional properties, surpassing even the latest LLMs. Additionally, we
introduce a new dataset of ~$192$K samples designed to facilitate future
benchmarking efforts on set-like compositionality of sentence embeddings.
☆ Arabizi vs LLMs: Can the Genie Understand the Language of Aladdin?
In this era of rapid technological advancements, communication continues to
evolve as new linguistic phenomena emerge. Among these is Arabizi, a hybrid
form of Arabic that incorporates Latin characters and numbers to represent the
spoken dialects of Arab communities. Arabizi is widely used on social media and
allows people to communicate in an informal and dynamic way, but it poses
significant challenges for machine translation due to its lack of formal
structure and deeply embedded cultural nuances. This case study arises from a
growing need to translate Arabizi for gisting purposes. It evaluates the
capacity of different LLMs to decode and translate Arabizi, focusing on
multiple Arabic dialects that have rarely been studied up until now. Using a
combination of human evaluators and automatic metrics, this research project
investigates the model's performance in translating Arabizi into both Modern
Standard Arabic and English. Key questions explored include which dialects are
translated most effectively and whether translations into English surpass those
into Arabic.
comment: Submitted to MT Summit 2025
☆ Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
Weixiang Zhao, Yulin Hu, Yang Deng, Jiahe Guo, Xingyu Sui, Xinyang Han, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Role-playing enables large language models (LLMs) to engage users in
immersive and personalized interactions, but it also introduces significant
safety risks. Existing role-play fine-tuning techniques improve role
adaptability but may degrade safety performance, particularly for villainous
characters. In this work, we conduct the first comprehensive assessment of
role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench.
Our experiments reveal that role-play fine-tuning leads to a noticeable decline
in safety performance, with safety risks varying based on character traits. To
tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a
novel method designed to balance role-playing capabilities and safety.
Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and
Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms
state-of-the-art baselines under both LoRA and full-parameter fine-tuning
settings. Our findings highlight the necessity of role-adaptive safety measures
and provide insights into mitigating role-specific safety risks in role-playing
LLMs.
comment: 25 pages, 10 figures, 13 tables
☆ WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval
We present WebFAQ, a large-scale collection of open-domain question answering
datasets derived from FAQ-style schema.org annotations. In total, the data
collection consists of 96 million natural question-answer (QA) pairs across 75
languages, including 47 million (49%) non-English samples. WebFAQ further
serves as the foundation for 20 monolingual retrieval benchmarks with a total
size of 11.2 million QA pairs (5.9 million non-English). These datasets are
carefully curated through refined filtering and near-duplicate detection,
yielding high-quality resources for training and evaluating multilingual dense
retrieval models. To empirically confirm WebFAQ's efficacy, we use the
collected QAs to fine-tune an in-domain pretrained XLM-RoBERTa model. Through
this process of dataset-specific fine-tuning, the model achieves significant
retrieval performance gains, which generalize - beyond WebFAQ - to other
multilingual retrieval benchmarks evaluated in zero-shot setting. Last but not
least, we utilize WebFAQ to construct a set of QA-aligned bilingual corpora
spanning over 1000 language pairs using state-of-the-art bitext mining and
automated LLM-assessed translation evaluation. Due to our advanced, automated
method of bitext dataset generation, the resulting bilingual corpora
demonstrate higher translation quality compared to similar datasets. WebFAQ and
all associated resources are publicly available on GitHub and HuggingFace.
comment: 10 pages, 3 figures, 7 tables
☆ Automated Evaluation of Meter and Rhyme in Russian Generative and Human-Authored Poetry
Generative poetry systems require effective tools for data engineering and
automatic evaluation, particularly to assess how well a poem adheres to
versification rules, such as the correct alternation of stressed and unstressed
syllables and the presence of rhymes.
In this work, we introduce the Russian Poetry Scansion Tool library designed
for stress mark placement in Russian-language syllabo-tonic poetry, rhyme
detection, and identification of defects of poeticness. Additionally, we
release RIFMA -- a dataset of poem fragments spanning various genres and forms,
annotated with stress marks. This dataset can be used to evaluate the
capability of modern large language models to accurately place stress marks in
poetic texts.
The published resources provide valuable tools for researchers and
practitioners in the field of creative generative AI, facilitating advancements
in the development and evaluation of generative poetry systems.
comment: 7 pages, 1 figure
☆ Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?
As AI systems are used in high-stakes applications, ensuring interpretability
is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural
networks by extracting human-understandable algorithms to explain their
behavior. This work examines a key question: for a given behavior, and under
MI's criteria, does a unique explanation exist? Drawing on identifiability in
statistics, where parameters are uniquely inferred under specific assumptions,
we explore the identifiability of MI explanations.
We identify two main MI strategies: (1) "where-then-what," which isolates a
circuit replicating model behavior before interpreting it, and (2)
"what-then-where," which starts with candidate algorithms and searches for
neural activation subspaces implementing them, using causal alignment.
We test both strategies on Boolean functions and small multi-layer
perceptrons, fully enumerating candidate explanations. Our experiments reveal
systematic non-identifiability: multiple circuits can replicate behavior, a
circuit can have multiple interpretations, several algorithms can align with
the network, and one algorithm can align with different subspaces.
Is uniqueness necessary? A pragmatic approach may require only predictive and
manipulability standards. If uniqueness is essential for understanding,
stricter criteria may be needed. We also reference the inner interpretability
framework, which validates explanations through multiple criteria. This work
contributes to defining explanation standards in AI.
☆ A database to support the evaluation of gender biases in GPT-4o output ISCA
The widespread application of Large Language Models (LLMs) involves ethical
risks for users and societies. A prominent ethical risk of LLMs is the
generation of unfair language output that reinforces or exacerbates harm for
members of disadvantaged social groups through gender biases (Weidinger et al.,
2022; Bender et al., 2021; Kotek et al., 2023). Hence, the evaluation of the
fairness of LLM outputs with respect to such biases is a topic of rising
interest. To advance research in this field, promote discourse on suitable
normative bases and evaluation methodologies, and enhance the reproducibility
of related studies, we propose a novel approach to database construction. This
approach enables the assessment of gender-related biases in LLM-generated
language beyond merely evaluating their degree of neutralization.
comment: ISCA/ITG Workshop on Diversity in Large Speech and Language Models
☆ Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions
People naturally vary in their annotations for subjective questions and some
of this variation is thought to be due to the person's sociodemographic
characteristics. LLMs have also been used to label data, but recent work has
shown that models perform poorly when prompted with sociodemographic
attributes, suggesting limited inherent sociodemographic knowledge. Here, we
ask whether LLMs can be trained to be accurate sociodemographic models of
annotator variation. Using a curated dataset of five tasks with standardized
sociodemographics, we show that models do improve in sociodemographic prompting
when trained but that this performance gain is largely due to models learning
annotator-specific behaviour rather than sociodemographic patterns. Across all
tasks, our results suggest that models learn little meaningful connection
between sociodemographics and annotation, raising doubts about the current use
of LLMs for simulating sociodemographic variation and behaviour.
comment: Reviewed ARR December 2024
☆ ProBench: Benchmarking Large Language Models in Competitive Programming
With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging,
large language models (LLMs) have entered a new phase of development. However,
existing benchmarks for coding evaluation are gradually inadequate to assess
the capability of advanced LLMs in code reasoning. To bridge the gap for
high-level code reasoning assessment, we propose ProBench to benchmark LLMs in
competitive programming, drawing inspiration from the International Collegiate
Programming Contest. ProBench collects a comprehensive set of competitive
programming problems from Codeforces, Luogu, and Nowcoder platforms during the
period from July to December 2024, obtaining real test results through online
submissions to ensure the fairness and accuracy of the evaluation. We establish
a unified problem attribute system, including difficulty grading and algorithm
tagging. With carefully collected and annotated data in ProBench, we
systematically assess 9 latest LLMs in competitive programming across multiple
dimensions, including thought chain analysis, error type diagnosis, and
reasoning depth evaluation. Experimental results show that QwQ-32B-Preview
achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38,
suggesting that models trained with specialized reasoning tasks significantly
outperform general-purpose models (even larger than reasoning-oriented models)
in programming. Further analysis also reveals key areas for programming
capability enhancement, e.g., algorithm adaptability and reasoning sufficiency,
providing important insights for the future development of reasoning models.
☆ Better Benchmarking LLMs for Zero-Shot Dependency Parsing
While LLMs excel in zero-shot tasks, their performance in linguistic
challenges like syntactic parsing has been less scrutinized. This paper studies
state-of-the-art open-weight LLMs on the task by comparing them to baselines
that do not have access to the input sentence, including baselines that have
not been used in this context such as random projective trees or optimal linear
arrangements. The results show that most of the tested LLMs cannot outperform
the best uninformed baselines, with only the newest and largest versions of
LLaMA doing so for most languages, and still achieving rather low performance.
Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.
comment: Accepted at NoDaLiDa/Baltic-HLT 2025
☆ Do Language Models Understand Honorific Systems in Javanese?
Mohammad Rifqi Farhansyah, Iwan Darmawan, Adryan Kusumawardhana, Genta Indra Winata, Alham Fikri Aji, Derry Tanti Wijaya
The Javanese language features a complex system of honorifics that vary
according to the social status of the speaker, listener, and referent. Despite
its cultural and linguistic significance, there has been limited progress in
developing a comprehensive corpus to capture these variations for natural
language processing (NLP) tasks. In this paper, we present Unggah-Ungguh, a
carefully curated dataset designed to encapsulate the nuances of Unggah-Ungguh
Basa, the Javanese speech etiquette framework that dictates the choice of words
and phrases based on social hierarchy and context. Using Unggah-Ungguh, we
assess the ability of language models (LMs) to process various levels of
Javanese honorifics through classification and machine translation tasks. To
further evaluate cross-lingual LMs, we conduct machine translation experiments
between Javanese (at specific honorific levels) and Indonesian. Additionally,
we explore whether LMs can generate contextually appropriate Javanese
honorifics in conversation tasks, where the honorific usage should align with
the social role and contextual cues. Our findings indicate that current LMs
struggle with most honorific levels, exhibitinga bias toward certain honorific
tiers.
☆ The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents
Large language models (LLMs) excel in both closed tasks (including
problem-solving, and code generation) and open tasks (including creative
writing), yet existing explanations for their capabilities lack connections to
real-world human intelligence. To fill this gap, this paper systematically
investigates LLM intelligence through the lens of ``human simulation'',
addressing three core questions: (1) How do personality traits affect
problem-solving in closed tasks? (2) How do traits shape creativity in open
tasks? (3) How does single-agent performance influence multi-agent
collaboration? By assigning Big Five personality traits to LLM agents and
evaluating their performance in single- and multi-agent settings, we reveal
that specific traits significantly influence reasoning accuracy (closed tasks)
and creative output (open tasks). Furthermore, multi-agent systems exhibit
collective intelligence distinct from individual capabilities, driven by
distinguishing combinations of personalities. We demonstrate that LLMs
inherently simulate human behavior through next-token prediction, mirroring
human language, decision-making, and collaborative dynamics.
☆ MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training
Mathematical formulas are a fundamental and widely used component in various
scientific fields, serving as a universal language for expressing complex
concepts and relationships. While state-of-the-art transformer models excel in
processing and understanding natural language, they encounter challenges with
mathematical notation, which involves a complex structure and diverse
representations. This study focuses on the development of specialized training
datasets to enhance the encoding of mathematical content. We introduce Math
Mutator (MAMUT), a framework capable of generating equivalent and falsified
versions of a given mathematical formula in LaTeX notation, effectively
capturing the mathematical variety in notation of the same concept. Based on
MAMUT, we have generated four large mathematical datasets containing diverse
notation, which can be used to train language models with enhanced mathematical
embeddings.
☆ A Pilot Empirical Study on When and How to Use Knowledge Graphs as Retrieval Augmented Generation
Xujie Yuan, Yongxu Liu, Shimin Di, Shiwen Wu, Libin Zheng, Rui Meng, Xiaofang Zhou, Lei Chen, Jian Yin
The integration of Knowledge Graphs (KGs) into the Retrieval Augmented
Generation (RAG) framework has attracted significant interest, with early
studies showing promise in mitigating hallucinations and improving model
accuracy. However, a systematic understanding and comparative analysis of the
rapidly emerging KG-RAG methods are still lacking. This paper seeks to lay the
foundation for systematically answering the question of when and how to use
KG-RAG by analyzing their performance in various application scenarios
associated with different technical configurations. After outlining the mind
map using KG-RAG framework and summarizing its popular pipeline, we conduct a
pilot empirical study of KG-RAG works to reimplement and evaluate 6 KG-RAG
methods across 7 datasets in diverse scenarios, analyzing the impact of 9
KG-RAG configurations in combination with 17 LLMs. Our results underscore the
critical role of appropriate application conditions and optimal configurations
of KG-RAG components.
comment: 8 pages, 2 figures, 14 tables
☆ Learning to Substitute Components for Compositional Generalization
Despite the rising prevalence of neural language models, recent empirical
evidence suggests their deficiency in compositional generalization. One of the
current de-facto solutions to this problem is compositional data augmentation,
which aims to introduce additional compositional inductive bias. However,
existing handcrafted augmentation strategies offer limited improvement when
systematic generalization of neural language models requires multi-grained
compositional bias (i.e., not limited to either lexical or structural biases
alone) or when training sentences have an imbalanced difficulty distribution.
To address these challenges, we first propose a novel compositional
augmentation strategy called Component Substitution (CompSub), which enables
multi-grained composition of substantial substructures across the entire
training set. Furthermore, we introduce the Learning Component Substitution
(LCS) framework. This framework empowers the learning of component substitution
probabilities in CompSub in an end-to-end manner by maximizing the loss of
neural language models, thereby prioritizing challenging compositions with
elusive concepts and novel contexts. We extend the key ideas of CompSub and LCS
to the recently emerging in-context learning scenarios of pre-trained large
language models (LLMs), proposing the LCS-ICL algorithm to enhance the few-shot
compositional generalization of state-of-the-art (SOTA) LLMs. Theoretically, we
provide insights into why applying our algorithms to language models can
improve compositional generalization performance. Empirically, our results on
four standard compositional generalization benchmarks(SCAN, COGS, GeoQuery, and
COGS-QL) demonstrate the superiority of CompSub, LCS, and LCS-ICL, with
improvements of up to 66.5%, 10.3%, 1.4%, and 8.8%, respectively.
comment: 23 pages, 9 figures, preprint, the extension paper of the paper
(arXiv:2306.02840)
☆ HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
Xiao Wang, Jingyun Hua, Weihong Lin, Yuanxing Zhang, Fuzheng Zhang, Jianlong Wu, Di Zhang, Liqiang Nie
Recent Multi-modal Large Language Models (MLLMs) have made great progress in
video understanding. However, their performance on videos involving human
actions is still limited by the lack of high-quality data. To address this, we
introduce a two-stage data annotation pipeline. First, we design strategies to
accumulate videos featuring clear human actions from the Internet. Second,
videos are annotated in a standardized caption format that uses human
attributes to distinguish individuals and chronologically details their actions
and interactions. Through this pipeline, we curate two datasets, namely
HAICTrain and HAICBench. \textbf{HAICTrain} comprises 126K video-caption pairs
generated by Gemini-Pro and verified for training purposes. Meanwhile,
\textbf{HAICBench} includes 500 manually annotated video-caption pairs and
1,400 QA pairs, for a comprehensive evaluation of human action understanding.
Experimental results demonstrate that training with HAICTrain not only
significantly enhances human understanding abilities across 4 benchmarks, but
can also improve text-to-video generation results. Both the HAICTrain and
HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.
☆ Plan2Align: Predictive Planning Based Test-Time Preference Alignment in Paragraph-Level Machine Translation
Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh
Machine Translation (MT) has been predominantly designed for sentence-level
translation using transformer-based architectures. While next-token prediction
based Large Language Models (LLMs) demonstrate strong capabilities in long-text
translation, non-extensive language models often suffer from omissions and
semantic inconsistencies when processing paragraphs. Existing preference
alignment methods improve sentence-level translation but fail to ensure
coherence over extended contexts due to the myopic nature of next-token
generation. We introduce Plan2Align, a test-time alignment framework that
treats translation as a predictive planning problem, adapting Model Predictive
Control to iteratively refine translation outputs. Experiments on WMT24
Discourse-Level Literary Translation show that Plan2Align significantly
improves paragraph-level translation, achieving performance surpassing or on
par with the existing training-time and test-time alignment methods on
LLaMA-3.1 8B.
comment: Preprint. Code will be released at Plan2Align GitHub link:
https://github.com/NYCU-RL-Bandits-Lab/Plan2Align
☆ Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision
Dawei Zhu, Xiyu Wei, Guangxiang Zhao, Wenhao Wu, Haosheng Zou, Junfeng Ran, Xun Wang, Lin Sun, Xiangzheng Zhang, Sujian Li
Recent advances in Large Language Models (LLMs) have highlighted the
challenge of handling long-context tasks, where models need to reason over
extensive input contexts to aggregate target information. While
Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning,
its effectiveness for long-context scenarios remains underexplored. Through
systematic investigation across diverse tasks, we demonstrate that CoT's
benefits generalize across most long-context scenarios and amplify with
increasing context length. Motivated by this critical observation, we propose
LongRePS, a process-supervised framework that teaches models to generate
high-quality reasoning paths for enhanced long-context performance. Our
framework incorporates a self-sampling mechanism to bootstrap reasoning paths
and a novel quality assessment protocol specifically designed for long-context
scenarios. Experimental results on various long-context benchmarks demonstrate
the effectiveness of our approach, achieving significant improvements over
outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for
LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on
average across diverse QA tasks). Our code, data and trained models are made
public to facilitate future research.
comment: 14 pages,6 figures
☆ GraphCheck: Multi-Path Fact-Checking with Entity-Relationship Graphs
Automated fact-checking aims to assess the truthfulness of text based on
relevant evidence, yet verifying complex claims requiring multi-hop reasoning
remains a significant challenge. We propose GraphCheck, a novel framework that
converts claims into entity-relationship graphs for comprehensive verification.
By identifying relation between explicit entities and latent entities across
multiple paths, GraphCheck enhances the adaptability and robustness of
verification. Furthermore, we introduce DP-GraphCheck, a two-stage variant that
improves performance by incorporating direct prompting as an initial filtering
step. Experiments on the HOVER and EX-FEVER datasets show that our approach
outperforms existing methods, particularly in multi-hop reasoning tasks.
Furthermore, our two-stage framework generalizes well to other fact-checking
pipelines, demonstrating its versatility.
☆ MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models
The increasing use of vision-language models (VLMs) in healthcare
applications presents great challenges related to hallucinations, in which the
models may generate seemingly plausible results that are in fact incorrect.
Such hallucinations can jeopardize clinical decision making, potentially
harming the diagnosis and treatments. In this work, we propose MedHallTune, a
large-scale benchmark designed specifically to evaluate and mitigate
hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000
instruction pairs, MedHallTune includes both hallucination and
non-hallucination samples, each with ground-truth annotations. We conduct a
comprehensive evaluation of current medical and general VLMs using MedHallTune,
assessing their performance across key metrics, including clinical accuracy,
relevance, detail level, and risk level. The experimental results show that
fine-tuning with MedHallTune successfully improves the ability of several
existing models to manage hallucinations and boost their zero-shot performance
on downstream visual-question-answering (VQA) tasks, making them more reliable
for practical medical applications. Our work contributes to the development of
more trustworthy VLMs. Codes and dataset will be available at
\href{https://github.com/russellyq/MedHallTune}{MedHallTune}.
☆ Triple Phase Transitions: Understanding the Learning Dynamics of Large Language Models from a Neuroscience Perspective
Large language models (LLMs) often exhibit abrupt emergent behavior, whereby
new abilities arise at certain points during their training. This phenomenon,
commonly referred to as a ''phase transition'', remains poorly understood. In
this study, we conduct an integrative analysis of such phase transitions by
examining three interconnected perspectives: the similarity between LLMs and
the human brain, the internal states of LLMs, and downstream task performance.
We propose a novel interpretation for the learning dynamics of LLMs that vary
in both training data and architecture, revealing that three phase transitions
commonly emerge across these models during training: (1) alignment with the
entire brain surges as LLMs begin adhering to task instructions Brain Alignment
and Instruction Following, (2) unexpectedly, LLMs diverge from the brain during
a period in which downstream task accuracy temporarily stagnates Brain
Detachment and Stagnation, and (3) alignment with the brain reoccurs as LLMs
become capable of solving the downstream tasks Brain Realignment and
Consolidation. These findings illuminate the underlying mechanisms of phase
transitions in LLMs, while opening new avenues for interdisciplinary research
bridging AI and neuroscience.
comment: 46 pages
☆ FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference ICLR 2025
Large language models (LLMs) encounter computational challenges during
long-sequence inference, especially in the attention pre-filling phase, where
the complexity grows quadratically with the prompt length. Previous efforts to
mitigate these challenges have relied on fixed sparse attention patterns or
identifying sparse attention patterns based on limited cases. However, these
methods lacked the flexibility to efficiently adapt to varying input demands.
In this paper, we introduce FlexPrefill, a Flexible sparse Pre-filling
mechanism that dynamically adjusts sparse attention patterns and computational
budget in real-time to meet the specific requirements of each input and
attention head. The flexibility of our method is demonstrated through two key
innovations: 1) Query-Aware Sparse Pattern Determination: By measuring
Jensen-Shannon divergence, this component adaptively switches between
query-specific diverse attention patterns and predefined attention patterns. 2)
Cumulative-Attention Based Index Selection: This component dynamically selects
query-key indexes to be computed based on different attention patterns,
ensuring the sum of attention scores meets a predefined threshold. FlexPrefill
adaptively optimizes the sparse pattern and sparse ratio of each attention head
based on the prompt, enhancing efficiency in long-sequence inference tasks.
Experimental results show significant improvements in both speed and accuracy
over prior methods, providing a more flexible and efficient solution for LLM
inference.
comment: Accepted at ICLR 2025 (Oral)
☆ Collective Reasoning Among LLMs A Framework for Answer Validation Without Ground Truth
We present a collaborative framework where multiple large language models,
namely GPT-4-0125-preview, Meta-LLaMA-3-70B-Instruct, Claude-3-Opus, and
Gemini-1.5-Flash, work together to generate and respond to complex PhD-level
probability questions in the absence of definitive ground truth. This study
explores how inter-model consensus enhances response reliability and serves as
a proxy for assessing the quality of generated questions. To quantify agreement
and consistency, we employ statistical methods including chi-square tests,
Fleiss' Kappa, and confidence interval analysis, measuring both response
precision and question clarity. Our findings highlight that Claude and Gemini
generate well-structured and less ambiguous questions, leading to higher
inter-model agreement. This is reflected in their narrower confidence intervals
and stronger alignment with answering models. Conversely, LLaMA demonstrates
increased variability and lower reliability in question formulation, as
indicated by broader confidence intervals and reduced consensus rates. These
results suggest that multi-model collaboration not only enhances the
reliability of responses but also provides a valuable framework for assessing
and improving question quality in the absence of explicit ground truth. This
research offers meaningful insights into optimizing AI-driven reasoning through
collaborative large-language model interactions.
comment: 14 pages, 2 figures. arXiv admin note: substantial text overlap with
arXiv:2411.16797
☆ The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents
Large Language Models (LLMs) have made remarkable advances in role-playing
dialogue agents, demonstrating their utility in character simulations. However,
it remains challenging for these agents to balance character portrayal utility
with content safety because this essential character simulation often comes
with the risk of generating unsafe content. To address this issue, we first
conduct a systematic exploration of the safety-utility trade-off across
multiple LLMs. Our analysis reveals that risk scenarios created by villain
characters and user queries (referred to as risk coupling) contribute to this
trade-off. Building on this, we propose a novel Adaptive Dynamic
Multi-Preference (ADMP) method, which dynamically adjusts safety-utility
preferences based on the degree of risk coupling and guides the model to
generate responses biased toward utility or safety. We further introduce
Coupling Margin Sampling (CMS) into coupling detection to enhance the model's
ability to handle high-risk scenarios. Experimental results demonstrate that
our approach improves safety metrics while maintaining utility.
☆ Acquiring Grounded Representations of Words with Situated Interactive Instruction
We present an approach for acquiring grounded representations of words from
mixed-initiative, situated interactions with a human instructor. The work
focuses on the acquisition of diverse types of knowledge including perceptual,
semantic, and procedural knowledge along with learning grounded meanings.
Interactive learning allows the agent to control its learning by requesting
instructions about unknown concepts, making learning efficient. Our approach
has been instantiated in Soar and has been evaluated on a table-top robotic arm
capable of manipulating small objects.
☆ Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow AAAI 2025
Large vision-language models show tremendous potential in understanding
visual information through human languages. However, they are prone to suffer
from object hallucination, i.e., the generated image descriptions contain
objects that do not exist in the image. In this paper, we reveal that object
hallucination can be attributed to overconfidence in irrelevant visual features
when soft visual tokens map to the LLM's word embedding space. Specifically, by
figuring out the semantic similarity between visual tokens and LLM's word
embedding, we observe that the smoothness of similarity distribution strongly
correlates with the emergence of object hallucinations. To mitigate
hallucinations, we propose using the Variational Information Bottleneck (VIB)
to alleviate overconfidence by introducing stochastic noise, facilitating the
constraining of irrelevant information. Furthermore, we propose an
entropy-based noise-controlling strategy to enable the injected noise to be
adaptively constrained regarding the smoothness of the similarity distribution.
We adapt the proposed AdaVIB across distinct model architectures. Experimental
results demonstrate that the proposed AdaVIB mitigates object hallucinations by
effectively alleviating the overconfidence in irrelevant visual features, with
consistent improvements on two object hallucination benchmarks.
comment: Accepted to AAAI 2025. Camera ready version
☆ Teach-to-Reason with Scoring: Self-Explainable Rationale-Driven Multi-Trait Essay Scoring
Multi-trait automated essay scoring (AES) systems provide a fine-grained
evaluation of an essay's diverse aspects. While they excel in scoring, prior
systems fail to explain why specific trait scores are assigned. This lack of
transparency leaves instructors and learners unconvinced of the AES outputs,
hindering their practical use. To address this, we propose a self-explainable
Rationale-Driven Multi-trait automated Essay scoring (RaDME) framework. RaDME
leverages the reasoning capabilities of large language models (LLMs) by
distilling them into a smaller yet effective scorer. This more manageable
student model is optimized to sequentially generate a trait score followed by
the corresponding rationale, thereby inherently learning to select a more
justifiable score by considering the subsequent rationale during training. Our
findings indicate that while LLMs underperform in direct AES tasks, they excel
in rationale generation when provided with precise numerical scores. Thus,
RaDME integrates the superior reasoning capacities of LLMs into the robust
scoring accuracy of an optimized smaller model. Extensive experiments
demonstrate that RaDME achieves both accurate and adequate reasoning while
supporting high-quality multi-trait scoring, significantly enhancing the
transparency of AES.
☆ Structured Preference Optimization for Vision-Language Long-Horizon Task Planning
Xiwen Liang, Min Lin, Weiqi Ruan, Rongtao Xu, Yuecheng Liu, Jiaqi Chen, Bingqian Lin, Yuzheng Zhuang, Xiaodan Liang
Existing methods for vision-language task planning excel in short-horizon
tasks but often fall short in complex, long-horizon planning within dynamic
environments. These challenges primarily arise from the difficulty of
effectively training models to produce high-quality reasoning processes for
long-horizon tasks. To address this, we propose Structured Preference
Optimization (SPO), which aims to enhance reasoning and action selection in
long-horizon task planning through structured preference evaluation and
optimized training strategies. Specifically, SPO introduces: 1)
Preference-Based Scoring and Optimization, which systematically evaluates
reasoning chains based on task relevance, visual grounding, and historical
consistency; and 2) Curriculum-Guided Training, where the model progressively
adapts from simple to complex tasks, improving its generalization ability in
long-horizon scenarios and enhancing reasoning robustness. To advance research
in vision-language long-horizon task planning, we introduce ExtendaBench, a
comprehensive benchmark covering 1,509 tasks across VirtualHome and Habitat
2.0, categorized into ultra-short, short, medium, and long tasks. Experimental
results demonstrate that SPO significantly improves reasoning quality and final
decision accuracy, outperforming prior methods on long-horizon tasks and
underscoring the effectiveness of preference-driven optimization in
vision-language task planning. Specifically, SPO achieves a +5.98% GCR and
+4.68% SR improvement in VirtualHome and a +3.30% GCR and +2.11% SR improvement
in Habitat over the best-performing baselines.
comment: 18 pages
☆ Retrieval Backward Attention without Additional Training: Enhance Embeddings of Large Language Models via Repetition
Language models can be viewed as functions that embed text into Euclidean
space, where the quality of the embedding vectors directly determines model
performance, training such neural networks involves various uncertainties. This
paper focuses on improving the performance of pre-trained language models in
zero-shot settings through a simple and easily implementable method. We propose
a novel backward attention mechanism to enhance contextual information
encoding. Evaluated on the Chinese Massive Text Embedding Benchmark (C-MTEB),
our approach achieves significant improvements across multiple tasks, providing
valuable insights for advancing zero-shot learning capabilities.
☆ ProAI: Proactive Multi-Agent Conversational AI with Structured Knowledge Base for Psychiatric Diagnosis
Yuqi Wu, Guangya Wan, Jingjing Li, Shengming Zhao, Lingfeng Ma, Tianyi Ye, Ion Pop, Yanbo Zhang, Jie Chen
Most LLM-driven conversational AI systems operate reactively, responding to
user prompts without guiding the interaction. Most LLM-driven conversational AI
systems operate reactively, responding to user prompts without guiding the
interaction. However, many real-world applications-such as psychiatric
diagnosis, consulting, and interviews-require AI to take a proactive role,
asking the right questions and steering conversations toward specific
objectives. Using mental health differential diagnosis as an application
context, we introduce ProAI, a goal-oriented, proactive conversational AI
framework. ProAI integrates structured knowledge-guided memory, multi-agent
proactive reasoning, and a multi-faceted evaluation strategy, enabling LLMs to
engage in clinician-style diagnostic reasoning rather than simple response
generation. Through simulated patient interactions, user experience assessment,
and professional clinical validation, we demonstrate that ProAI achieves up to
83.3% accuracy in mental disorder differential diagnosis while maintaining
professional and empathetic interaction standards. These results highlight the
potential for more reliable, adaptive, and goal-driven AI diagnostic
assistants, advancing LLMs beyond reactive dialogue systems.
comment: 21 pages, 8 figures
☆ JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation
While large language models (LLMs) have made significant strides in
generating coherent and contextually relevant text, they often function as
opaque black boxes, trained on vast unlabeled datasets with statistical
objectives, lacking an interpretable framework for responsible control. In this
paper, we introduce JAM (Just A Move), a novel framework that interprets and
controls text generation by integrating cause-effect analysis within the latent
space of LLMs. Based on our observations, we uncover the inherent causality in
LLM generation, which is critical for producing responsible and realistic
outputs. Moreover, we explore latent vectors as fundamental components in LLM
architectures, aiming to understand and manipulate them for more effective and
efficient controllable text generation. We evaluate our framework using a range
of tools, including the HHH criteria, toxicity reduction benchmarks, and GPT-4
alignment measures. Our results show that JAM achieves up to a 22% improvement
over previous Controllable Text Generation (CTG) methods across multiple
quantitative metrics and human-centric evaluations. Furthermore, JAM
demonstrates greater computational efficiency compared to other CTG methods.
These results highlight the effectiveness and efficiency of JAM for responsible
and realistic text generation, paving the way for more interpretable and
controllable models.
comment: 10 pages, 3 figures, and 6 tables
☆ Fine-tuning BERT with Bidirectional LSTM for Fine-grained Movie Reviews Sentiment Analysis
Sentiment Analysis (SA) is instrumental in understanding peoples viewpoints
facilitating social media monitoring recognizing products and brands and
gauging customer satisfaction. Consequently SA has evolved into an active
research domain within Natural Language Processing (NLP). Many approaches
outlined in the literature devise intricate frameworks aimed at achieving high
accuracy, focusing exclusively on either binary sentiment classification or
fine-grained sentiment classification. In this paper our objective is to
fine-tune the pre-trained BERT model with Bidirectional LSTM (BiLSTM) to
enhance both binary and fine-grained SA specifically for movie reviews. Our
approach involves conducting sentiment classification for each review followed
by computing the overall sentiment polarity across all reviews. We present our
findings on binary classification as well as fine-grained classification
utilizing benchmark datasets. Additionally we implement and assess two accuracy
improvement techniques Synthetic Minority Oversampling Technique (SMOTE) and
NLP Augmenter (NLPAUG) to bolster the models generalization in fine-grained
sentiment classification. Finally a heuristic algorithm is employed to
calculate the overall polarity of predicted reviews from the BERT+BiLSTM output
vector. Our approach performs comparably with state-of-the-art (SOTA)
techniques in both classifications. For instance in binary classification we
achieve 97.67% accuracy surpassing the leading SOTA model
NB-weighted-BON+dv-cosine by 0.27% on the renowned IMDb dataset. Conversely for
five-class classification on SST-5 while the top SOTA model
RoBERTa+large+Self-explaining attains 55.5% accuracy our model achieves 59.48%
accuracy surpassing the BERT-large baseline by 3.6%.
comment: 14 pages, 5 figures, published in International Journal On Advances
in Systems and Measurements, volume 16, numbers 3 and 4, 2023
☆ Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers
Transformers may exhibit two-stage training dynamics during the real-world
training process. For instance, when training GPT-2 on the Counterfact dataset,
the answers progress from syntactically incorrect to syntactically correct to
semantically correct. However, existing theoretical analyses hardly account for
this two-stage phenomenon. In this paper, we theoretically demonstrate how such
two-stage training dynamics occur in transformers. Specifically, we analyze the
dynamics of transformers using feature learning techniques under in-context
learning regimes, based on a disentangled two-type feature structure. Such
disentanglement of feature structure is general in practice, e.g., natural
languages contain syntax and semantics, and proteins contain primary and
secondary structures. To our best known, this is the first rigorous result
regarding a two-stage optimization process in transformers. Additionally, a
corollary indicates that such a two-stage process is closely related to the
spectral properties of the attention weights, which accords well with empirical
findings.
☆ Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository
Prediction of item difficulty based on its text content is of substantial
interest. In this paper, we focus on the related problem of recovering
IRT-based difficulty when the data originally reported item p-value (percent
correct responses). We model this item difficulty using a repository of reading
passages and student data from US standardized tests from New York and Texas
for grades 3-8 spanning the years 2017-23. This repository is annotated with
meta-data on (1) linguistic features of the reading items, (2) test features of
the passage, and (3) context features. A penalized regression prediction model
with all these features can predict item difficulty with RMSE 0.52 compared to
baseline RMSE of 0.92, and with a correlation of 0.77 between true and
predicted difficulty. We supplement these features with embeddings from LLMs
(ModernBERT, BERT, and LlAMA), which marginally improve item difficulty
prediction. When models use only item linguistic features or LLM embeddings,
prediction performance is similar, which suggests that only one of these
feature categories may be required. This item difficulty prediction model can
be used to filter and categorize reading items and will be made publicly
available for use by other stakeholders.
☆ Automatic database description generation for Text-to-SQL
In the context of the Text-to-SQL task, table and column descriptions are
crucial for bridging the gap between natural language and database schema. This
report proposes a method for automatically generating effective database
descriptions when explicit descriptions are unavailable. The proposed method
employs a dual-process approach: a coarse-to-fine process, followed by a
fine-to-coarse process. The coarse-to-fine approach leverages the inherent
knowledge of LLM to guide the understanding process from databases to tables
and finally to columns. This approach provides a holistic understanding of the
database structure and ensures contextual alignment. Conversely, the
fine-to-coarse approach starts at the column level, offering a more accurate
and nuanced understanding when stepping back to the table level. Experimental
results on the Bird benchmark indicate that using descriptions generated by the
proposed improves SQL generation accuracy by 0.93\% compared to not using
descriptions, and achieves 37\% of human-level performance. The source code is
publicly available at https://github.com/XGenerationLab/XiYan-DBDescGen.
☆ Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models
Text summarizing is a critical Natural Language Processing (NLP) task with
applications ranging from information retrieval to content generation. Large
Language Models (LLMs) have shown remarkable promise in generating fluent
abstractive summaries but they can produce hallucinated details not grounded in
the source text. Regardless of the method of generating a summary, high quality
automated evaluations remain an open area of investigation. This paper embarks
on an exploration of text summarization with a diverse set of techniques,
including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. The
generated summaries are evaluated using traditional metrics such as the
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score and
Bidirectional Encoder Representations from Transformers (BERT) Score, as well
as LLM-powered evaluation methods that directly assess a generated summary's
consistency with the source text. We introduce a meta evaluation score which
directly assesses the performance of the LLM evaluation system (prompt +
model). We find that that all summarization models produce consistent summaries
when tested on the XL-Sum dataset, exceeding the consistency of the reference
summaries.
comment: 21 pages, 6 figures, 4 tables
☆ LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation
Haitao Li, Yifan Chen, Yiran Hu, Qingyao Ai, Junjie Chen, Xiaoyu Yang, Jianhui Yang, Yueyue Wu, Zeyang Liu, Yiqun Liu
Retrieval-augmented generation (RAG) has proven highly effective in improving
large language models (LLMs) across various domains. However, there is no
benchmark specifically designed to assess the effectiveness of RAG in the legal
domain, which restricts progress in this area. To fill this gap, we propose
LexRAG, the first benchmark to evaluate RAG systems for multi-turn legal
consultations. LexRAG consists of 1,013 multi-turn dialogue samples and 17,228
candidate legal articles. Each sample is annotated by legal experts and
consists of five rounds of progressive questioning. LexRAG includes two key
tasks: (1) Conversational knowledge retrieval, requiring accurate retrieval of
relevant legal articles based on multi-turn context. (2) Response generation,
focusing on producing legally sound answers. To ensure reliable
reproducibility, we develop LexiT, a legal RAG toolkit that provides a
comprehensive implementation of RAG system components tailored for the legal
domain. Additionally, we introduce an LLM-as-a-judge evaluation pipeline to
enable detailed and effective assessment. Through experimental analysis of
various LLMs and retrieval methods, we reveal the key limitations of existing
RAG systems in handling legal consultation conversations. LexRAG establishes a
new benchmark for the practical application of RAG systems in the legal domain,
with its code and data available at https://github.com/CSHaitao/LexRAG.
comment: 10 pages
☆ Rectifying Belief Space via Unlearning to Harness LLMs' Reasoning
Large language models (LLMs) can exhibit advanced reasoning yet still
generate incorrect answers. We hypothesize that such errors frequently stem
from spurious beliefs, propositions the model internally considers true but are
incorrect. To address this, we propose a method to rectify the belief space by
suppressing these spurious beliefs while simultaneously enhancing true ones,
thereby enabling more reliable inferences. Our approach first identifies the
beliefs that lead to incorrect or correct answers by prompting the model to
generate textual explanations, using our Forward-Backward Beam Search (FBBS).
We then apply unlearning to suppress the identified spurious beliefs and
enhance the true ones, effectively rectifying the model's belief space.
Empirical results on multiple QA datasets and LLMs show that our method
corrects previously misanswered questions without harming overall model
performance. Furthermore, our approach yields improved generalization on unseen
data, suggesting that rectifying a model's belief space is a promising
direction for mitigating errors and enhancing overall reliability.
☆ Continuous Adversarial Text Representation Learning for Affective Recognition
While pre-trained language models excel at semantic understanding, they often
struggle to capture nuanced affective information critical for affective
recognition tasks. To address these limitations, we propose a novel framework
for enhancing emotion-aware embeddings in transformer-based models. Our
approach introduces a continuous valence-arousal labeling system to guide
contrastive learning, which captures subtle and multi-dimensional emotional
nuances more effectively. Furthermore, we employ a dynamic token perturbation
mechanism, using gradient-based saliency to focus on sentiment-relevant tokens,
improving model sensitivity to emotional cues. The experimental results
demonstrate that the proposed framework outperforms existing methods, achieving
up to 15.5% improvement in the emotion classification benchmark, highlighting
the importance of employing continuous labels. This improvement demonstrates
that the proposed framework is effective in affective representation learning
and enables precise and contextually relevant emotional understanding.
comment: 6 pages, 3 figures, The 7th International Conference on Artificial
Intelligence in Information and Communication (ICAIIC 2025)
☆ Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems
We introduce a simple approach that uses a large language model (LLM) to
automatically implement a fully interpretable rule-based data-to-text system in
pure Python. Experimental evaluation on the WebNLG dataset showed that such a
constructed system produces text of better quality (according to the BLEU and
BLEURT metrics) than the same LLM prompted to directly produce outputs, and
produces fewer hallucinations than a BART language model fine-tuned on the same
data. Furthermore, at runtime, the approach generates text in a fraction of the
processing time required by neural approaches, using only a single CPU
☆ NutriGen: Personalized Meal Plan Generator Leveraging Large Language Models to Enhance Dietary and Nutritional Adherence
Maintaining a balanced diet is essential for overall health, yet many
individuals struggle with meal planning due to nutritional complexity, time
constraints, and lack of dietary knowledge. Personalized food recommendations
can help address these challenges by tailoring meal plans to individual
preferences, habits, and dietary restrictions. However, existing dietary
recommendation systems often lack adaptability, fail to consider real-world
constraints such as food ingredient availability, and require extensive user
input, making them impractical for sustainable and scalable daily use. To
address these limitations, we introduce NutriGen, a framework based on large
language models (LLM) designed to generate personalized meal plans that align
with user-defined dietary preferences and constraints. By building a
personalized nutrition database and leveraging prompt engineering, our approach
enables LLMs to incorporate reliable nutritional references like the USDA
nutrition database while maintaining flexibility and ease-of-use. We
demonstrate that LLMs have strong potential in generating accurate and
user-friendly food recommendations, addressing key limitations in existing
dietary recommendation systems by providing structured, practical, and scalable
meal plans. Our evaluation shows that Llama 3.1 8B and GPT-3.5 Turbo achieve
the lowest percentage errors of 1.55\% and 3.68\%, respectively, producing meal
plans that closely align with user-defined caloric targets while minimizing
deviation and improving precision. Additionally, we compared the performance of
DeepSeek V3 against several established models to evaluate its potential in
personalized nutrition planning.
♻ ☆ The GUS Framework: Benchmarking Social Bias Classification with Discriminative (Encoder-Only) and Generative (Decoder-Only) Language Models
The detection of social bias in text is a critical challenge, particularly
due to the limitations of binary classification methods. These methods often
oversimplify nuanced biases, leading to high emotional impact when content is
misclassified as either "biased" or "fair." To address these shortcomings, we
propose a more nuanced framework that focuses on three key linguistic
components underlying social bias: Generalizations, Unfairness, and Stereotypes
(the GUS framework). The GUS framework employs a semi-automated approach to
create a comprehensive synthetic dataset, which is then verified by humans to
maintain ethical standards. This dataset enables robust multi-label token
classification. Our methodology, which combines discriminative (encoder-only)
models and generative (auto-regressive large language models), identifies
biased entities in text. Through extensive experiments, we demonstrate that
encoder-only models are effective for this complex task, often outperforming
state-of-the-art methods, both in terms of macro and entity-wise F1-score and
Hamming loss. These findings can guide the choice of model for different use
cases, highlighting the GUS framework's effectiveness in capturing explicit and
implicit biases across diverse contexts, and offering a pathway for future
research and applications in various fields.
♻ ☆ Can Large Language Models Predict the Outcome of Judicial Decisions?
Large Language Models (LLMs) have shown exceptional capabilities in Natural
Language Processing (NLP) across diverse domains. However, their application in
specialized tasks such as Legal Judgment Prediction (LJP) for low-resource
languages like Arabic remains underexplored. In this work, we address this gap
by developing an Arabic LJP dataset, collected and preprocessed from Saudi
commercial court judgments. We benchmark state-of-the-art open-source LLMs,
including LLaMA-3.2-3B and LLaMA-3.1-8B, under varying configurations such as
zero-shot, one-shot, and fine-tuning using LoRA. Additionally, we employed a
comprehensive evaluation framework that integrates both quantitative metrics
(such as BLEU, ROUGE, and BERT) and qualitative assessments (including
Coherence, Legal Language, Clarity, etc.) using an LLM. Our results demonstrate
that fine-tuned smaller models achieve comparable performance to larger models
in task-specific contexts while offering significant resource efficiency.
Furthermore, we investigate the impact of fine-tuning the model on a diverse
set of instructions, offering valuable insights into the development of a more
human-centric and adaptable LLM. We have made the dataset, code, and models
publicly available to provide a solid foundation for future research in Arabic
legal NLP.
♻ ☆ Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis
Humour styles can have either a negative or a positive impact on well-being.
Given the importance of these styles to mental health, significant research has
been conducted on their automatic identification. However, the automated
machine learning models used for this purpose are black boxes, making their
prediction decisions opaque. Clarity and transparency are vital in the field of
mental health. This paper presents an explainable AI (XAI) framework for
understanding humour style classification, building upon previous work in
computational humour analysis. Using the best-performing single model
(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to
analyse how linguistic, emotional, and semantic features contribute to humour
style classification decisions. Our analysis reveals distinct patterns in how
different humour styles are characterised and misclassified, with particular
emphasis on the challenges in distinguishing affiliative humour from other
styles. Through detailed examination of feature importance, error patterns, and
misclassification cases, we identify key factors influencing model decisions,
including emotional ambiguity, context misinterpretation, and target
identification. The framework demonstrates significant utility in understanding
model behaviour, achieving interpretable insights into the complex interplay of
features that define different humour styles. Our findings contribute to both
the theoretical understanding of computational humour analysis and practical
applications in mental health, content moderation, and digital humanities
research.
♻ ☆ Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
We study how to subvert large language models (LLMs) from following
prompt-specified rules. We first formalize rule-following as inference in
propositional Horn logic, a mathematical system in which rules have the form
"if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. Next, we
prove that although small transformers can faithfully follow such rules,
maliciously crafted prompts can still mislead both theoretical constructions
and models learned from data. Furthermore, we demonstrate that popular attack
algorithms on LLMs find adversarial prompts and induce attention patterns that
align with our theory. Our novel logic-based framework provides a foundation
for studying LLMs in rule-based settings, enabling a formal analysis of tasks
like logical reasoning and jailbreak attacks.
♻ ☆ Logical Consistency of Large Language Models in Fact-checking ICLR 2025
In recent years, large language models (LLMs) have demonstrated significant
success in performing varied natural language tasks such as language
translation, question-answering, summarizing, fact-checking, etc. Despite LLMs'
impressive ability to generate human-like texts, LLMs are infamous for their
inconsistent responses - a meaning-preserving change in the input query results
in an inconsistent response and attributes to vulnerabilities of LLMs such as
hallucination. Consequently, existing research focuses on simple
paraphrasing-based consistency assessment of LLMs, and ignores complex queries
that necessitate an even better understanding of logical reasoning by an LLM.
Our work therefore addresses the logical inconsistency of LLMs under complex
logical queries with primitive logical operators, e.g., negation, conjunction,
and disjunction. As a test bed, we consider retrieval-augmented LLMs on a
fact-checking task involving propositional logic queries from knowledge graphs
(KGs). Our contributions are threefold. Benchmark: We introduce three logical
fact-checking datasets over KGs for community development towards logically
consistent LLMs. Assessment: We propose consistency measures of LLMs on
propositional logic queries and demonstrate that existing LLMs lack logical
consistency, especially on complex queries. Improvement: We employ supervised
fine-tuning to improve the logical consistency of LLMs on the complex
fact-checking task with KG contexts. We have made our source code and
benchmarks available.
comment: Published at ICLR 2025
♻ ☆ Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA ICLR 2025
Large language models (LLMs) are expensive to deploy. Parameter sharing
offers a possible path towards reducing their size and cost, but its
effectiveness in modern LLMs remains fairly limited. In this work, we revisit
"layer tying" as form of parameter sharing in Transformers, and introduce novel
methods for converting existing LLMs into smaller "Recursive Transformers" that
share parameters across layers, with minimal loss of performance. Here, our
Recursive Transformers are efficiently initialized from standard pretrained
Transformers, but only use a single block of unique layers that is then
repeated multiple times in a loop. We further improve performance by
introducing Relaxed Recursive Transformers that add flexibility to the layer
tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still
preserve the compactness of the overall model. We show that our recursive
models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla
pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge
distillation baselines -- and can even recover most of the performance of the
original "full-size" model (e.g., Gemma 2B with no shared parameters). Finally,
we propose Continuous Depth-wise Batching, a promising new inference paradigm
enabled by the Recursive Transformer when paired with early exiting. In a
theoretical analysis, we show that this has the potential to lead to
significant (2-3x) gains in inference throughput.
comment: ICLR 2025; 49 pages, 17 figures, 19 tables
♻ ☆ Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation
Molecule-and-text cross-modal representation learning has emerged as a
promising direction for enhancing the quality of molecular representation,
thereby improving performance in various scientific fields, including drug
discovery and materials science. Existing studies adopt a global alignment
approach to learn the knowledge from different modalities. These global
alignment approaches fail to capture fine-grained information, such as
molecular fragments and their corresponding textual description, which is
crucial for downstream tasks. Furthermore, it is incapable to model such
information using a similar global alignment strategy due to data scarcity of
paired local part annotated data from existing datasets. In this paper, we
propose Atomas, a multi-modal molecular representation learning framework to
jointly learn representations from SMILES string and text. We design a
Hierarchical Adaptive Alignment model to concurrently learn the fine-grained
fragment correspondence between two modalities and align these representations
of fragments in three levels. Additionally, Atomas's end-to-end training
framework incorporates the tasks of understanding and generating molecule,
thereby supporting a wider range of downstream tasks. In the retrieval task,
Atomas exhibits robust generalization ability and outperforms the baseline by
30.8% of recall@1 on average. In the generation task, Atomas achieves
state-of-the-art results in both molecule captioning task and molecule
generation task. Moreover, the visualization of the Hierarchical Adaptive
Alignment model further confirms the chemical significance of our approach. Our
codes can be found at https://anonymous.4open.science/r/Atomas-03C3.
♻ ☆ You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning
The ever-increasing size of large language models (LLMs) presents significant
challenges for deployment due to their heavy computational and memory
requirements. Current model pruning techniques attempt to alleviate these
issues by relying heavily on external calibration datasets to determine which
parameters to prune or compress, thus limiting their flexibility and
scalability across different compression ratios. Moreover, these methods often
cause severe performance degradation, particularly in downstream tasks, when
subjected to higher compression rates. In this paper, we propose PruneNet, a
novel model compression method that addresses these limitations by
reformulating model pruning as a policy learning process. PruneNet decouples
the pruning process from the model architecture, eliminating the need for
calibration datasets. It learns a stochastic pruning policy to assess parameter
importance solely based on intrinsic model properties while preserving the
spectral structure to minimize information loss. PruneNet can compress the
LLaMA-2-7B model in just 15 minutes, achieving over 80% retention of its
zero-shot performance with a 30% compression ratio, outperforming existing
methods that retain only 75% performance. Furthermore, on complex multitask
language understanding tasks, PruneNet demonstrates its robustness by
preserving up to 80% performance of the original model, proving itself a
superior alternative to conventional structured compression techniques.
♻ ☆ CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery ICLR 2025
Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, Weiran Xu
Large language models (LLMs) have demonstrated significant potential in
advancing various fields of research and society. However, the current
community of LLMs overly focuses on benchmarks for analyzing specific
foundational skills (e.g. mathematics and code generation), neglecting an
all-round evaluation of the computer science field. To bridge this gap, we
introduce CS-Bench, the first multilingual (English, Chinese, French, German)
benchmark dedicated to evaluating the performance of LLMs in computer science.
CS-Bench comprises approximately 10K meticulously curated test samples,
covering 26 subfields across 4 key areas of computer science, encompassing
various task forms and divisions of knowledge and reasoning. Utilizing
CS-Bench, we conduct a comprehensive evaluation of over 30 mainstream LLMs,
revealing the relationship between CS performance and model scales. We also
quantitatively analyze the reasons for failures in existing LLMs and highlight
directions for improvements, including knowledge supplementation and
CS-specific reasoning. Further cross-capability experiments show a high
correlation between LLMs' capabilities in computer science and their abilities
in mathematics and coding. Moreover, expert LLMs specialized in mathematics and
coding also demonstrate strong performances in several CS subfields. Looking
ahead, we envision CS-Bench serving as a cornerstone for LLM applications in
the CS field and paving new avenues in assessing LLMs' diverse reasoning
capabilities. The CS-Bench data and evaluation code are available at
https://github.com/csbench/csbench.
comment: Accepted at ICLR 2025
♻ ☆ SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Large Language Models (LLMs) have demonstrated exceptional performance across
diverse tasks, yet their training remains highly resource-intensive and
susceptible to critical challenges such as training instability. A predominant
source of this instability stems from gradient and loss spikes, which disrupt
the learning process, often leading to costly interventions like checkpoint
recovery and experiment restarts, further amplifying inefficiencies. This paper
presents a comprehensive investigation into gradient spikes observed during LLM
training, revealing their prevalence across multiple architectures and
datasets. Our analysis shows that these spikes can be up to $1000\times$ larger
than typical gradients, substantially deteriorating model performance. To
address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a
novel optimizer designed to counteract gradient spikes through momentum reset
and spike-aware gradient clipping. Extensive experiments, including both
pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam
and its variants across various tasks, including (1) LLM pre-training from 60M
to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time
Series Forecasting. Additionally, SPAM facilitates memory-efficient training by
enabling sparse momentum, where only a subset of momentum terms are maintained
and updated. When operating under memory constraints, SPAM outperforms
state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our
work underscores the importance of mitigating gradient spikes in LLM training
and introduces an effective optimization strategy that enhances both training
stability and resource efficiency at scale. Code is available at
https://github.com/TianjinYellow/SPAM-Optimizer.git
♻ ☆ GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse
The exponential growth of social media has profoundly transformed how
information is created, disseminated, and absorbed, exceeding any precedent in
the digital age. Regrettably, this explosion has also spawned a significant
increase in the online abuse of memes. Evaluating the negative impact of memes
is notably challenging, owing to their often subtle and implicit meanings,
which are not directly conveyed through the overt text and image. In light of
this, large multimodal models (LMMs) have emerged as a focal point of interest
due to their remarkable capabilities in handling diverse multimodal tasks. In
response to this development, our paper aims to thoroughly examine the capacity
of various LMMs (e.g., GPT-4o) to discern and respond to the nuanced aspects of
social abuse manifested in memes. We introduce the comprehensive meme
benchmark, GOAT-Bench, comprising over 6K varied memes encapsulating themes
such as implicit hate speech, sexism, and cyberbullying, etc. Utilizing
GOAT-Bench, we delve into the ability of LMMs to accurately assess hatefulness,
misogyny, offensiveness, sarcasm, and harmful content. Our extensive
experiments across a range of LMMs reveal that current models still exhibit a
deficiency in safety awareness, showing insensitivity to various forms of
implicit abuse. We posit that this shortfall represents a critical impediment
to the realization of safe artificial intelligence. The GOAT-Bench and
accompanying resources are publicly accessible at https://goatlmm.github.io/,
contributing to ongoing research in this vital field.
comment: The first work to benchmark Large Multimodal Models in safety insight
on social media
♻ ☆ AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models
As Large Language Models (LLMs) are pretrained on massive-scale corpora, the
issue of data contamination has become increasingly severe, leading to
potential overestimation of model performance during evaluation. To address
this, we propose AdEval (Alignment-based Dynamic Evaluation), a dynamic data
evaluation method aimed at mitigating the impact of data contamination on
evaluation reliability. AdEval extracts key knowledge points and main ideas to
align dynamically generated questions with static data's core concepts. It also
leverages online search to provide detailed explanations of related knowledge
points, thereby creating high-quality evaluation samples with robust knowledge
support. Furthermore, AdEval incorporates mechanisms to control the number and
complexity of questions, enabling dynamic alignment and flexible adjustment.
This ensures that the generated questions align with the complexity of static
data while supporting varied complexity levels. Based on Bloom's taxonomy,
AdEval conducts a multi-dimensional evaluation of LLMs across six cognitive
levels: remembering, understanding, applying, analyzing, evaluating, and
creating. Experimental results on multiple datasets demonstrate that AdEval
effectively reduces the impact of data contamination on evaluation outcomes,
enhancing both the fairness and reliability of the evaluation process.
comment: There are serious academic problems in this paper, such as data
falsification and plagiarism in the method of the paper
♻ ☆ Are All Spanish Doctors Male? Evaluating Gender Bias in German Machine Translation ISCA
We present WinoMTDE, a new gender bias evaluation test set designed to assess
occupational stereotyping and underrepresentation in German machine translation
(MT) systems. Building on the automatic evaluation method introduced by
arXiv:1906.00591v1, we extend the approach to German, a language with
grammatical gender. The WinoMTDE dataset comprises 288 German sentences that
are balanced in regard to gender, as well as stereotype, which was annotated
using German labor statistics. We conduct a large-scale evaluation of five
widely used MT systems and a large language model. Our results reveal
persistent bias in most models, with the LLM outperforming traditional systems.
The dataset and evaluation code are publicly available under
https://github.com/michellekappl/mt_gender_german.
comment: ISCA/ITG Workshop on Diversity in Large Speech and Language Models
♻ ★ Learning diverse attacks on large language models for robust red-teaming and safety tuning ICLR 2025
Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh Jain
Red-teaming, or identifying prompts that elicit harmful responses, is a
critical step in ensuring the safe and responsible deployment of large language
models (LLMs). Developing effective protection against many modes of attack
prompts requires discovering diverse attacks. Automated red-teaming typically
uses reinforcement learning to fine-tune an attacker language model to generate
prompts that elicit undesirable responses from a target LLM, as measured, for
example, by an auxiliary toxicity classifier. We show that even with explicit
regularization to favor novelty and diversity, existing approaches suffer from
mode collapse or fail to generate effective attacks. As a flexible and
probabilistically principled alternative, we propose to use GFlowNet
fine-tuning, followed by a secondary smoothing phase, to train the attacker
model to generate diverse and effective attack prompts. We find that the
attacks generated by our method are effective against a wide range of target
LLMs, both with and without safety tuning, and transfer well between target
LLMs. Finally, we demonstrate that models safety-tuned using a dataset of
red-teaming prompts generated by our method are robust to attacks from other
RL-based red-teaming approaches.
comment: ICLR 2025
♻ ☆ Kanana: Compute-efficient Bilingual Language Models
Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo
We introduce Kanana, a series of bilingual language models that demonstrate
exceeding performance in Korean and competitive performance in English. The
computational cost of Kanana is significantly lower than that of
state-of-the-art models of similar size. The report details the techniques
employed during pre-training to achieve compute-efficient yet competitive
models, including high quality data filtering, staged pre-training, depth
up-scaling, and pruning and distillation. Furthermore, the report outlines the
methodologies utilized during the post-training of the Kanana models,
encompassing supervised fine-tuning and preference optimization, aimed at
enhancing their capability for seamless interaction with users. Lastly, the
report elaborates on plausible approaches used for language model adaptation to
specific scenarios, such as embedding, retrieval augmented generation, and
function calling. The Kanana model series spans from 2.1B to 32.5B parameters
with 2.1B models (base, instruct, embedding) publicly released to promote
research on Korean language models.
comment: 40 pages, 15 figures
♻ ☆ Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization ICLR 2025
Superalignment, where humans act as weak supervisors for superhuman models,
has become a crucial problem with the rapid development of Large Language
Models (LLMs). Recent work has preliminarily studied this problem by using weak
models to supervise strong models, and discovered that weakly supervised strong
students can consistently outperform weak teachers towards the alignment
target, leading to a weak-to-strong generalization phenomenon. However, we are
concerned that behind such a promising phenomenon, whether there exists an
issue of weak-to-strong deception, where strong models deceive weak models by
exhibiting well-aligned in areas known to weak models but producing misaligned
behaviors in cases weak models do not know. We take an initial step towards
exploring this security issue in a specific but realistic multi-objective
alignment case, where there may be some alignment targets conflicting with each
other (e.g., helpfulness v.s. harmlessness). We aim to explore whether, in such
cases, strong models might deliberately make mistakes in areas known to them
but unknown to weak models within one alignment dimension, in exchange for a
higher reward in another dimension. Through extensive experiments in both the
reward modeling and preference optimization scenarios, we find: (1) The
weak-to-strong deception phenomenon exists across all settings. (2) The
deception intensifies as the capability gap between weak and strong models
increases. (3) Bootstrapping with an intermediate model can mitigate the
deception to some extent, though its effectiveness remains limited. Our work
highlights the urgent need to pay more attention to the true reliability of
superalignment.
comment: Accepted at ICLR 2025, camera-ready version
♻ ☆ Pragmatic Reasoning improves LLM Code Generation
Large Language Models (LLMs) have demonstrated impressive potential in
translating natural language (NL) instructions into program code. However, user
instructions often contain inherent ambiguities, making it challenging for LLMs
to generate code that accurately reflects the user's true intent. To address
this challenge, researchers have proposed to produce multiple candidates of the
program code and then rerank them to identify the best solution. In this paper,
we propose CodeRSA, a novel code candidate reranking mechanism built upon the
Rational Speech Act (RSA) framework, designed to guide LLMs toward more
comprehensive pragmatic reasoning about user intent. We evaluate CodeRSA using
one of the latest LLMs on a popular code generation dataset. Our experiment
results show that CodeRSA consistently outperforms common baselines, surpasses
the state-of-the-art approach in most cases, and demonstrates robust overall
performance. These findings underscore the effectiveness of integrating
pragmatic reasoning into code candidate reranking, offering a promising
direction for enhancing code generation quality in LLMs.
♻ ☆ ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation ICLR 2025
Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang
We introduce a new benchmark, ChartMimic, aimed at assessing the
visually-grounded code generation capabilities of large multimodal models
(LMMs). ChartMimic utilizes information-intensive visual charts and textual
instructions as inputs, requiring LMMs to generate the corresponding code for
chart rendering. ChartMimic includes 4,800 human-curated (figure, instruction,
code) triplets, which represent the authentic chart use cases found in
scientific papers across various domains (e.g., Physics, Computer Science,
Economics, etc). These charts span 18 regular types and 4 advanced types,
diversifying into 201 subcategories. Furthermore, we propose multi-level
evaluation metrics to provide an automatic and thorough assessment of the
output code and the rendered charts. Unlike existing code generation
benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to
harmonize a blend of cognitive capabilities, encompassing visual understanding,
code generation, and cross-modal reasoning. The evaluation of $3$ proprietary
models and 14 open-weight models highlights the substantial challenges posed by
ChartMimic. Even the advanced GPT-4o, InternVL2-Llama3-76B only achieved an
average score across Direct Mimic and Customized Mimic tasks of 82.2 and 61.6,
respectively, indicating significant room for improvement. We anticipate that
ChartMimic will inspire the development of LMMs, advancing the pursuit of
artificial general intelligence.
comment: Accepted to ICLR 2025. Data and code are available at
https://github.com/ChartMimic/ChartMimic
♻ ☆ SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models
Large language models (LLMs) have been widely adopted due to their remarkable
performance across various applications, driving the accelerated development of
a large number of diverse models. However, these individual LLMs show
limitations in generalization and performance on complex tasks due to inherent
training biases, model size constraints, and the quality or diversity of
pre-training datasets. A promising direction is to efficiently harness the
diverse capabilities of LLMs to overcome these individual limitations. To
address these limitations, we introduce a novel LLM selection algorithm called
SelectLLM, which efficiently directs input queries to the most suitable subset
of LLMs from a large pool, ensuring that the selected models collectively
provide accurate responses. SelectLLM employs a multi-label classifier and
policy based on the classifier's predictions and confidence scores in selecting
an optimal, query-aware, and lightweight subset of LLMs. Our findings indicate
that the proposed model outperforms existing ensemble-based baselines and
achieves competitive performance with similarly sized top-performing LLMs while
maintaining efficiency. Specifically, it achieves a huge reduction in inference
latency on two challenging reasoning benchmarks: 13% on GSM8K and 70% on MMLU,
compared to the top-performing baseline. Also, we establish a theoretical upper
bound by an Oracle with LLMs and perform an in-depth linguistic analysis to
understand the performance gap between the Oracle and SelectLLM.
comment: 8 pages
♻ ☆ Training-Free Exponential Context Extension via Cascading KV Cache
The transformer's context window is vital for tasks such as few-shot learning
and conditional generation as it preserves previous tokens for active memory.
However, as the context lengths increase, the computational costs grow
quadratically, hindering the deployment of large language models (LLMs) in
real-world, long sequence scenarios. Although some recent key-value caching (KV
Cache) methods offer linear inference complexity, they naively manage the
stored context, prematurely evicting tokens and losing valuable information.
Moreover, they lack an optimized prefill/prompt stage strategy, resulting in
higher latency than even quadratic attention for realistic context sizes. In
response, we introduce a novel mechanism that leverages cascading sub-cache
buffers to selectively retain the most relevant tokens, enabling the model to
maintain longer context histories without increasing the cache size. Our
approach outperforms linear caching baselines across key benchmarks, including
streaming perplexity, question answering, book summarization, and passkey
retrieval, where it retains better retrieval accuracy at 1M tokens after four
doublings of the cache size of 65K. Additionally, our method reduces prefill
stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
These innovations not only enhance the computational efficiency of LLMs but
also pave the way for their effective deployment in resource-constrained
environments, enabling large-scale, real-time applications with significantly
reduced latency.
♻ ☆ LLM2: Let Large Language Models Harness System 2 Reasoning NAACL 2025
Large language models (LLMs) have exhibited impressive capabilities across a
myriad of tasks, yet they occasionally yield undesirable outputs. We posit that
these limitations are rooted in the foundational autoregressive architecture of
LLMs, which inherently lacks mechanisms for differentiating between desirable
and undesirable results. Drawing inspiration from the dual-process theory of
human cognition, we introduce LLM2, a novel framework that combines an LLM
(System 1) with a process-based verifier (System 2). Within LLM2, the LLM is
responsible for generating plausible candidates, while the verifier provides
timely process-based feedback to distinguish desirable and undesirable outputs.
The verifier is trained with a pairwise comparison loss on synthetic
process-supervision data generated through our token quality exploration
strategy. Empirical results on mathematical reasoning benchmarks substantiate
the efficacy of LLM2, exemplified by an accuracy enhancement from 50.3 to 57.8
(+7.5) for Llama3-1B on GSM8K. Furthermore, when combined with
self-consistency, LLM2 achieves additional improvements, boosting major@20
accuracy from 56.2 to 70.2 (+14.0).
comment: Accepted to NAACL 2025 Main Conference
♻ ☆ Behind the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models
Small language models (SLMs) have become increasingly prominent in the
deployment on edge devices due to their high efficiency and low computational
cost. While researchers continue to advance the capabilities of SLMs through
innovative training strategies and model compression techniques, the security
risks of SLMs have received considerably less attention compared to large
language models (LLMs).To fill this gap, we provide a comprehensive empirical
study to evaluate the security performance of 13 state-of-the-art SLMs under
various jailbreak attacks. Our experiments demonstrate that most SLMs are quite
susceptible to existing jailbreak attacks, while some of them are even
vulnerable to direct harmful prompts.To address the safety concerns, we
evaluate several representative defense methods and demonstrate their
effectiveness in enhancing the security of SLMs. We further analyze the
potential security degradation caused by different SLM techniques including
architecture compression, quantization, knowledge distillation, and so on. We
expect that our research can highlight the security challenges of SLMs and
provide valuable insights to future work in developing more robust and secure
SLMs.
comment: 12 pages. 6 figures
♻ ☆ MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps NAACL 2025
Xiongtao Zhou, Jie He, Lanyu Chen, Jingyu Li, Haojing Chen, Víctor Gutiérrez-Basulto, Jeff Z. Pan, Hanjie Chen
Multimodal Chain of Thought (MCoT) is a popular prompting strategy for
improving the performance of multimodal large language models (MLLMs) across a
range of complex reasoning tasks. Despite its popularity, there is a notable
absence of automated methods for evaluating the quality of reasoning steps in
MCoT. To address this gap, we propose Multimodal Chain-of-Thought Evaluation
(MiCEval), a framework designed to assess the correctness of reasoning chains
by evaluating the quality of both the description and each reasoning step. The
evaluation of the description component focuses on the accuracy of the image
descriptions, while the reasoning step evaluates the quality of each step as it
is conditionally generated based on the preceding steps. MiCEval is built upon
a fine-grained dataset with annotations that rate each step according to
correctness, relevance, and informativeness. Extensive experiments on four
state-of-the-art MLLMs show that step-wise evaluations using MiCEval align more
closely with human judgments compared to existing methods based on cosine
similarity or fine-tuning approaches. MiCEval datasets and code can be found in
https://github.com/alenai97/MiCEval.
comment: NAACL 2025
♻ ☆ DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation
Large language models (LLMs) have achieved significant success across various
domains. However, training these LLMs typically involves substantial memory and
computational costs during both forward and backward propagation. While
parameter-efficient fine-tuning (PEFT) considerably reduces the training memory
associated with parameters, it does not address the significant computational
costs and activation memory. In this paper, we propose Dropping Backward
Propagation (DropBP), a novel approach designed to reduce computational costs
and activation memory while maintaining accuracy. DropBP randomly drops layers
during backward propagation, which is essentially equivalent to training
shallow submodules generated by undropped layers and residual connections.
Additionally, DropBP calculates the sensitivity of each layer to assign an
appropriate drop rate, thereby stabilizing the training process. DropBP is not
only applicable to full fine-tuning but can also be orthogonally integrated
with all types of PEFT by dropping layers during backward propagation.
Specifically, DropBP can reduce training time by 44% with comparable accuracy
to the baseline, accelerate convergence to the same perplexity by 1.5x, and
enable training with a sequence length 6.2x larger on a single NVIDIA-A100 GPU.
Furthermore, our DropBP enabled a throughput increase of 79% on a NVIDIA A100
GPU and 117% on an Intel Gaudi2 HPU. The code is available at
https://github.com/WooSunghyeon/dropbp.
♻ ☆ ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock
Forecasts of future events are essential inputs into informed
decision-making. Machine learning (ML) systems have the potential to deliver
forecasts at scale, but there is no framework for evaluating the accuracy of ML
systems on a standardized set of forecasting questions. To address this gap, we
introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML
systems on an automatically generated and regularly updated set of 1,000
forecasting questions. To avoid any possibility of data leakage, ForecastBench
is comprised solely of questions about future events that have no known answer
at the time of submission. We quantify the capabilities of current ML systems
by collecting forecasts from expert (human) forecasters, the general public,
and LLMs on a random subset of questions from the benchmark ($N=200$). While
LLMs have achieved super-human performance on many benchmarks, they perform
less well here: expert forecasters outperform the top-performing LLM ($p$-value
$<0.001$). We display system and human scores in a public leaderboard at
www.forecastbench.org.
♻ ☆ Explore the Reasoning Capability of LLMs in the Chess Testbed NAACL2025
Reasoning is a central capability of human intelligence. In recent years,
with the advent of large-scale datasets, pretrained large language models have
emerged with new capabilities, including reasoning. However, these models still
struggle with long-term, complex reasoning tasks, such as playing chess. Based
on the observation that expert chess players employ a dual approach combining
long-term strategic play with short-term tactical play along with language
explanation, we propose improving the reasoning capability of large language
models in chess by integrating annotated strategy and tactic. Specifically, we
collect a dataset named MATE, which consists of 1 million chess positions with
candidate moves annotated by chess experts for strategy and tactics. We
finetune the LLaMA-3-8B model and compare it against state-of-the-art
commercial language models in the task of selecting better chess moves. Our
experiments show that our models perform better than GPT, Claude, and Gemini
models. We find that language explanations can enhance the reasoning capability
of large language models.
comment: NAACL2025 Main Conference. Data and models are available:
https://mate-chess.github.io/
♻ ☆ Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models
Recent advancements in long-context language models (LCLMs) promise to
transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With
their expanded context windows, LCLMs can process entire knowledge bases and
perform retrieval and reasoning directly -- a capability we define as
In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like
LOFT often overestimate LCLM performance by providing overly simplified
contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs
in more realistic scenarios by including confounding passages retrieved with
strong retrievers. We then propose three methods to enhance LCLM performance:
(1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which
uses attention heads to filter and de-noise long contexts during decoding, and
(3) joint retrieval head training alongside the generation head. Our evaluation
of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with
our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on
LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised
fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks
despite being a much smaller model.
♻ ☆ Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models ICLR'25
Recent research shows that fine-tuning on benign instruction-following data
can inadvertently undo the safety alignment process and increase a model's
propensity to comply with harmful queries. While instruction-following
fine-tuning is important, task-specific fine-tuning - where models are trained
on datasets with clear ground truth answers (e.g., multiple choice questions) -
can enhance model performance on specialized downstream tasks. Understanding
and mitigating safety risks in the task-specific setting remains distinct from
the instruction-following context due to structural differences in the data.
Our work demonstrates how malicious actors can subtly manipulate the structure
of almost any task-specific dataset to foster significantly more dangerous
model behaviors, while maintaining an appearance of innocuity and reasonable
downstream task performance. To address this issue, we propose a novel
mitigation strategy that mixes in safety data which mimics the task format and
prompting style of the user data, showing this is significantly more effective
and efficient than existing baselines at re-establishing safety alignment while
maintaining similar task performance.
comment: Accepted to ICLR'25
♻ ☆ Learning Efficient Recursive Numeral Systems via Reinforcement Learning
It has previously been shown that by using reinforcement learning (RL),
agents can derive simple approximate and exact-restricted numeral systems that
are similar to human ones (Carlsson, 2021). However, it is a major challenge to
show how more complex recursive numeral systems, similar to for example
English, could arise via a simple learning mechanism such as RL. Here, we
introduce an approach towards deriving a mechanistic explanation of the
emergence of efficient recursive number systems. We consider pairs of agents
learning how to communicate about numerical quantities through a meta-grammar
that can be gradually modified throughout the interactions. %We find that the
seminal meta-grammar of Hurford (Hurford, 1975) is not suitable for this
application as its optimization results in systems that deviate from standard
conventions observed within human numeral systems. We propose a simple
modification which addresses this issue. Utilising a slightly modified version
of the meta-grammar of Hurford, we demonstrate that our RL agents, shaped by
the pressures for efficient communication, can effectively modify their lexicon
towards Pareto-optimal configurations which are comparable to those observed
within human numeral systems in terms of their efficiency.
comment: 8 pages, 5 figures
♻ ☆ Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation
Quality Estimation (QE) models evaluate the quality of machine translations
without reference translations, serving as the reward models for the
translation task. Due to the data scarcity, synthetic data generation has
emerged as a promising solution. However, synthetic QE data often suffers from
distribution shift, which can manifest as discrepancies between pseudo and real
translations, or in pseudo labels that do not align with human preferences. To
tackle this issue, we introduce ADSQE, a novel framework for alleviating
distribution shift in synthetic QE data. To reduce the difference between
pseudo and real translations, we employ the constrained beam search algorithm
and enhance translation diversity through the use of distinct generation
models. ADSQE uses references, i.e., translation supervision signals, to guide
both the generation and annotation processes, enhancing the quality of
word-level labels. ADSE further identifies the shortest phrase covering
consecutive error tokens, mimicking human annotation behavior, to assign the
final phrase-level labels. Specially, we underscore that the translation model
can not annotate translations of itself accurately. Extensive experiments
demonstrate that ADSQE outperforms SOTA baselines like COMET in both supervised
and unsupervised settings. Further analysis offers insights into synthetic data
generation that could benefit reward models for other tasks.
♻ ☆ Small Models are LLM Knowledge Triggers on Medical Tabular Prediction ICLR 2025
Recent development in large language models (LLMs) has demonstrated
impressive domain proficiency on unstructured textual or multi-modal tasks.
However, despite with intrinsic world knowledge, their application on
structured tabular data prediction still lags behind, primarily due to the
numerical insensitivity and modality discrepancy that brings a gap between LLM
reasoning and statistical tabular learning. Unlike textual or vision data
(e.g., electronic clinical notes or medical imaging data), tabular data is
often presented in heterogeneous numerical values (e.g., CBC reports). This
ubiquitous data format requires intensive expert annotation, and its numerical
nature limits LLMs' capability to effectively transfer untapped domain
expertise. In this paper, we propose SERSAL, a general self-prompting method by
synergy learning with small models to enhance LLM tabular prediction in an
unsupervised manner. Specifically, SERSAL utilizes the LLM's prior outcomes as
original soft noisy annotations, which are dynamically leveraged to teach a
better small student model. Reversely, the outcomes from the trained small
model are used to teach the LLM to further refine its real capability. This
process can be repeatedly applied to gradually distill refined knowledge for
continuous progress. Comprehensive experiments on widely used medical domain
tabular datasets show that, without access to gold labels, applying SERSAL to
OpenAI GPT reasoning process attains substantial improvement compared to
linguistic prompting methods, which serves as an orthogonal direction for
tabular LLM, and increasing prompting bonus is observed as more powerful LLMs
appear.
comment: Accepted to ICLR 2025. Codes will be available at
https://github.com/jyansir/sersal
♻ ☆ Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues ICLR
Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and
DeltaNet have emerged as efficient alternatives to Transformers for long
sequences. However, both Transformers and LRNNs struggle to perform
state-tracking, which may impair performance in tasks such as code evaluation.
In one forward pass, current architectures are unable to solve even parity, the
simplest state-tracking task, which non-linear RNNs can handle effectively.
Recently, Sarrof et al. (2024) demonstrated that the failure of LRNNs like
Mamba to solve parity stems from restricting the value range of their diagonal
state-transition matrices to $[0, 1]$ and that incorporating negative values
can resolve this issue. We extend this result to non-diagonal LRNNs such as
DeltaNet. We prove that finite precision LRNNs with state-transition matrices
having only positive eigenvalues cannot solve parity, while non-triangular
matrices are needed to count modulo $3$. Notably, we also prove that LRNNs can
learn any regular language when their state-transition matrices are products of
identity minus vector outer product matrices, each with eigenvalues in the
range $[-1, 1]$. Our experiments confirm that extending the eigenvalue range of
Mamba and DeltaNet to include negative values not only enables them to solve
parity but consistently improves their performance on state-tracking tasks. We
also show that state-tracking enabled LRNNs can be pretrained stably and
efficiently at scale (1.3B parameters), achieving competitive performance on
language modeling and showing promise on code and math tasks.
comment: V2: Correction to Theorem 1 and 2 and to point 3 of Proposition 1.
V3: ICLR Camera Ready
♻ ☆ ColPali: Efficient Document Retrieval with Vision Language Models ICLR 2025
Documents are visually rich structures that convey information through text,
but also figures, page layouts, tables, or even fonts. Since modern retrieval
systems mainly rely on the textual information they extract from document pages
to index documents -often through lengthy and brittle processes-, they struggle
to exploit key visual cues efficiently. This limits their capabilities in many
practical document retrieval applications such as Retrieval Augmented
Generation (RAG). To benchmark current systems on visually rich document
retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe,
composed of various page-level retrieval tasks spanning multiple domains,
languages, and practical settings. The inherent complexity and performance
shortcomings of modern systems motivate a new concept; doing document retrieval
by directly embedding the images of the document pages. We release ColPali, a
Vision Language Model trained to produce high-quality multi-vector embeddings
from images of document pages. Combined with a late interaction matching
mechanism, ColPali largely outperforms modern document retrieval pipelines
while being drastically simpler, faster and end-to-end trainable. We release
models, data, code and benchmarks under open licenses at https://hf.co/vidore.
comment: Published as a conference paper at ICLR 2025
♻ ☆ Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric
Yuming Yang, Yang Nan, Junjie Ye, Shihan Dou, Xiao Wang, Shuo Li, Huijie Lv, Mingqi Wu, Tao Gui, Qi Zhang, Xuanjing Huang
Data diversity is crucial for the instruction tuning of large language
models. Existing studies have explored various diversity-aware data selection
methods to construct high-quality datasets and enhance model performance.
However, the fundamental problem of precisely defining and measuring data
diversity remains underexplored, limiting clear guidance for data engineering.
To address this, we systematically analyze 11 existing diversity measurement
methods by evaluating their correlation with model performance through
extensive fine-tuning experiments. Our results indicate that a reliable
diversity measure should properly account for both inter-sample differences and
the information distribution in the sample space. Building on this, we propose
NovelSum, a new diversity metric based on sample-level "novelty." Experiments
on both simulated and real-world data show that NovelSum accurately captures
diversity variations and achieves a 0.97 correlation with instruction-tuned
model performance, highlighting its value in guiding data engineering
practices. With NovelSum as an optimization objective, we further develop a
greedy, diversity-oriented data selection strategy that outperforms existing
approaches, validating both the effectiveness and practical significance of our
metric.
comment: 16 pages. The related codes and resources will be released later.
Project page: https://github.com/UmeanNever/NovelSum
♻ ☆ Energy-Based Diffusion Language Models for Text Generation
Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, Arash Vahdat
Despite remarkable progress in autoregressive language models, alternative
generative paradigms beyond left-to-right generation are still being actively
explored. Discrete diffusion models, with the capacity for parallel generation,
have recently emerged as a promising alternative. Unfortunately, these models
still underperform the autoregressive counterparts, with the performance gap
increasing when reducing the number of sampling steps. Our analysis reveals
that this degradation is a consequence of an imperfect approximation used by
diffusion models. In this work, we propose Energy-based Diffusion Language
Model (EDLM), an energy-based model operating at the full sequence level for
each diffusion step, introduced to improve the underlying approximation used by
diffusion models. More specifically, we introduce an EBM in a residual form,
and show that its parameters can be obtained by leveraging a pretrained
autoregressive model or by finetuning a bidirectional transformer via noise
contrastive estimation. We also propose an efficient generation algorithm via
parallel important sampling. Comprehensive experiments on language modeling
benchmarks show that our model can consistently outperform state-of-the-art
diffusion models by a significant margin, and approaches autoregressive models'
perplexity. We further show that, without any generation performance drop, our
framework offers a 1.3$\times$ sampling speedup over existing diffusion models.
♻ ☆ Beyond Natural Language Perplexity: Detecting Dead Code Poisoning in Code Generation Datasets
The increasing adoption of large language models (LLMs) for code-related
tasks has raised concerns about the security of their training datasets. One
critical threat is dead code poisoning, where syntactically valid but
functionally redundant code is injected into training data to manipulate model
behavior. Such attacks can degrade the performance of neural code search
systems, leading to biased or insecure code suggestions. Existing detection
methods, such as token-level perplexity analysis, fail to effectively identify
dead code due to the structural and contextual characteristics of programming
languages. In this paper, we propose DePA (Dead Code Perplexity Analysis), a
novel line-level detection and cleansing method tailored to the structural
properties of code. DePA computes line-level perplexity by leveraging the
contextual relationships between code lines and identifies anomalous lines by
comparing their perplexity to the overall distribution within the file. Our
experiments on benchmark datasets demonstrate that DePA significantly
outperforms existing methods, achieving 0.14-0.19 improvement in detection
F1-score and a 44-65% increase in poisoned segment localization precision.
Furthermore, DePA enhances detection speed by 0.62-23x, making it practical for
large-scale dataset cleansing. Overall, by addressing the unique challenges of
dead code poisoning, DePA provides a robust and efficient solution for
safeguarding the integrity of code generation model training datasets.
♻ ☆ Scaling Large-Language-Model-based Multi-Agent Collaboration ICLR-2025
Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun
Recent breakthroughs in large language model-driven autonomous agents have
revealed that multi-agent collaboration often surpasses each individual through
collective reasoning. Inspired by the neural scaling law--increasing neurons
enhances performance, this study explores whether the continuous addition of
collaborative agents can yield similar benefits. Technically, we utilize
directed acyclic graphs to organize agents into a multi-agent collaboration
network (MacNet), upon which their interactive reasoning is topologically
orchestrated for autonomous task solving. Extensive evaluations reveal that it
effectively supports collaboration among over a thousand agents, with irregular
topologies outperforming regular ones. We also identify a collaborative scaling
law--the overall performance follows a logistic growth pattern as agents scale,
with collaborative emergence occurring earlier than traditional neural
emergence. We speculate this may be because scaling agents catalyzes their
multidimensional considerations during interactive reflection and refinement,
thereby producing more comprehensive artifacts. The code is available at
https://github.com/OpenBMB/ChatDev/tree/macnet.
comment: Accepted to ICLR-2025; https://github.com/OpenBMB/ChatDev/tree/macnet
♻ ☆ AutoBencher: Towards Declarative Benchmark Construction ICLR 2025
We present AutoBencher, a declarative framework for automatic benchmark
construction, and use it to scalably discover novel insights and
vulnerabilities of existing language models. Concretely, given a few desiderata
of benchmarks (e.g., question difficulty, topic salience), we operationalize
each desideratum and cast benchmark creation as an optimization problem.
Specifically, we experiment with two settings with different optimization
objectives: (i) for capability evaluation, we declare the goal of finding a
salient, difficult dataset that induces novel performance patterns; (ii) for
safety evaluation, we declare the goal of finding a dataset of unsafe prompts
that existing LMs fail to decline. To tackle this optimization problem, we use
a language model to iteratively propose and refine dataset descriptions, which
are then used to generate topic-specific questions and answers. These
descriptions are optimized to improve the declared desiderata. We use
AutoBencher (powered by GPT-4) to create datasets for math, multilinguality,
knowledge, and safety. The scalability of AutoBencher allows it to test
fine-grained categories and tail knowledge, creating datasets that elicit 22%
more model errors (i.e., difficulty) than existing benchmarks. On the novelty
ends, AutoBencher also helps identify specific gaps not captured by existing
benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and
Fordism while GPT-4o fails to decline harmful requests about cryptocurrency
scams.
comment: Accepted for publication at ICLR 2025
♻ ☆ Self-Training Elicits Concise Reasoning in Large Language Models
Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to
utilize additional computation through intermediate tokens to solve complex
tasks. However, we posit that typical reasoning traces contain many redundant
tokens, incurring extraneous inference costs. Upon examination of the output
distribution of current LLMs, we find evidence on their latent ability to
reason more concisely, relative to their default behavior. To elicit this
capability, we propose simple fine-tuning methods which leverage self-generated
concise reasoning paths obtained by best-of-N sampling and few-shot
conditioning, in task-specific settings. Our combined method achieves a 30%
reduction in output tokens on average, across five model families on GSM8K and
MATH, while maintaining average accuracy. By exploiting the fundamental
stochasticity and in-context learning capabilities of LLMs, our self-training
approach robustly elicits concise reasoning on a wide range of models,
including those with extensive post-training. Code is available at
https://github.com/TergelMunkhbat/concise-reasoning
comment: 23 pages, 10 figures, 18 tables
♻ ☆ A non-ergodic framework for understanding emergent capabilities in Large Language Models
Large language models have emergent capabilities that come unexpectedly at
scale, but we need a theoretical framework to explain why and how they emerge.
We prove that language models are actually non-ergodic systems while providing
a mathematical framework based on Stuart Kauffman's theory of the adjacent
possible (TAP) to explain capability emergence. Our resource-constrained TAP
equation demonstrates how architectural, training, and contextual constraints
interact to shape model capabilities through phase transitions in semantic
space. We prove through experiments with three different language models that
capacities emerge through discrete transitions guided by constraint
interactions and path-dependent exploration. This framework provides a
theoretical basis for understanding emergence in language models and guides the
development of architectures that can guide capability emergence.
♻ ☆ Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
Creating high-quality data for training robust language-instructed agents is
a long-lasting challenge in embodied AI. In this paper, we introduce a
Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale
navigational instruction-trajectory pairs by iteratively refining the data pool
through the collaboration between two models, the instruction generator and the
navigator, without any human-in-the-loop annotation. Specifically, SRDF starts
with using a base generator to create an initial data pool for training a base
navigator, followed by applying the trained navigator to filter the data pool.
This leads to higher-fidelity data to train a better generator, which can, in
turn, produce higher-quality data for training the next-round navigator. Such a
flywheel establishes a data self-refining process, yielding a continuously
improved and highly effective dataset for large-scale language-guided
navigation learning. Our experiments demonstrate that after several flywheel
rounds, the navigator elevates the performance boundary from 70% to 78% SPL on
the classic R2R test set, surpassing human performance (76%) for the first
time. Meanwhile, this process results in a superior generator, evidenced by a
SPICE increase from 23.5 to 26.2, better than all previous VLN instruction
generation methods. Finally, we demonstrate the scalability of our method
through increasing environment and instruction diversity, and the
generalization ability of our pre-trained navigator across various downstream
navigation tasks, surpassing state-of-the-art methods by a large margin in all
cases.
comment: 28 pages, Code and data are available at
https://github.com/wz0919/VLN-SRDF
♻ ☆ ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains ICLR 2025
Large language models (LLMs) have brought significant changes to many aspects
of our lives. However, assessing and ensuring their chronological knowledge
remains challenging. Existing approaches fall short in addressing the temporal
adaptability of knowledge, often relying on a fixed time-point view. To
overcome this, we introduce ChroKnowBench, a benchmark dataset designed to
evaluate chronologically accumulated knowledge across three key aspects:
multiple domains, time dependency, temporal state. Our benchmark distinguishes
between knowledge that evolves (e.g., personal history, scientific discoveries,
amended laws) and knowledge that remain constant (e.g., mathematical truths,
commonsense facts). Building on this benchmark, we present ChroKnowledge
(Chronological Categorization of Knowledge), a novel sampling-based framework
for evaluating LLMs' non-parametric chronological knowledge. Our evaluation led
to the following observations: (1) The ability of eliciting temporal knowledge
varies depending on the data format that model was trained on. (2) LLMs
partially recall knowledge or show a cut-off at temporal boundaries rather than
recalling all aspects of knowledge correctly. Thus, we apply our
ChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by
traversing step-by-step through the surrounding time spans. We observe that it
successfully recalls objects across both open-source and proprietary LLMs,
demonstrating versatility, though it faces challenges with dynamic datasets and
unstructured formats.
comment: ICLR 2025, 40 pages, 17 figures
♻ ☆ MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models ICLR 2025
Inductive reasoning is an essential capability for large language models
(LLMs) to achieve higher intelligence, which requires the model to generalize
rules from observed facts and then apply them to unseen examples. We present
MIRAGE, a synthetic dataset that addresses the limitations of previous work,
specifically the lack of comprehensive evaluation and flexible test data. In
it, we evaluate LLMs' capabilities in both the inductive and deductive stages,
allowing for flexible variation in input distribution, task scenario, and task
difficulty to analyze the factors influencing LLMs' inductive reasoning. Based
on these multi-faceted evaluations, we demonstrate that the LLM is a poor
rule-based reasoner. In many cases, when conducting inductive reasoning, they
do not rely on a correct rule to answer the unseen case. From the perspectives
of different prompting methods, observation numbers, and task forms, models
tend to consistently conduct correct deduction without correct inductive rules.
Besides, we find that LLMs are good neighbor-based reasoners. In the inductive
reasoning process, the model tends to focus on observed facts that are close to
the current test example in feature space. By leveraging these similar
examples, the model maintains strong inductive capabilities within a localized
region, significantly improving its deductive performance.
comment: Accepted as ICLR 2025 conference paper (26 pages, 16 tables, 9
figures)
♻ ☆ PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models
Qian Zhang, Panfeng Chen, Jiali Li, Linkun Feng, Shuyu Liu, Heng Zhao, Mei Chen, Hui Li, Yanhao Wang
The emergence of Large Language Models (LLMs) in the medical domain has
stressed a compelling need for standard datasets to evaluate their
question-answering (QA) performance. Although there have been several benchmark
datasets for medical QA, they either cover common knowledge across different
departments or are specific to another department rather than pediatrics.
Moreover, some of them are limited to objective questions and do not measure
the generation capacity of LLMs. Therefore, they cannot comprehensively assess
the QA ability of LLMs in pediatrics. To fill this gap, we construct
PediaBench, the first Chinese pediatric dataset for LLM evaluation.
Specifically, it contains 4,117 objective questions and 1,632 subjective
questions spanning 12 pediatric disease groups. It adopts an integrated scoring
criterion based on different difficulty levels to thoroughly assess the
proficiency of an LLM in instruction following, knowledge understanding,
clinical case analysis, etc. Finally, we validate the effectiveness of
PediaBench with extensive experiments on 20 open-source and commercial LLMs.
Through an in-depth analysis of experimental results, we offer insights into
the ability of LLMs to answer pediatric questions in the Chinese context,
highlighting their limitations for further improvements. Our code and data are
published at https://github.com/ACMISLab/PediaBench.
comment: 21 pages, 12 figures
♻ ☆ LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion
Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen
Large language models (LLMs) have demonstrated remarkable progress in
understanding long-context inputs. However, benchmarks for evaluating the
long-context reasoning abilities of LLMs fall behind the pace. Existing
benchmarks often focus on a narrow range of tasks or those that do not demand
complex reasoning. To address this gap and enable a more comprehensive
evaluation of the long-context reasoning capabilities of current LLMs, we
propose a new synthetic benchmark, LongReason, which is constructed by
synthesizing long-context reasoning questions from a varied set of
short-context reasoning questions through context expansion. LongReason
consists of 794 multiple-choice reasoning questions with diverse reasoning
patterns across three task categories: reading comprehension, logical
inference, and mathematical word problems. We evaluate 21 LLMs on LongReason,
revealing that most models experience significant performance drops as context
length increases. Our further analysis shows that even state-of-the-art LLMs
still have significant room for improvement in providing robust reasoning
across different tasks. We have open-sourced LongReason under
https://huggingface.co/datasets/lz1bytedance/LongReason to support the
comprehensive evaluation of LLMs' long-context reasoning capabilities.
♻ ☆ ARS: Automatic Routing Solver with Large Language Models
Real-world Vehicle Routing Problems (VRPs) are characterized by a variety of
practical constraints, making manual solver design both knowledge-intensive and
time-consuming. Although there is increasing interest in automating the design
of routing algorithms, existing research has explored only a limited array of
VRP variants and fails to adequately address the complex and prevalent
constraints encountered in real-world situations. To fill this gap, this paper
introduces RoutBench, a benchmark of 1,000 VRP variants derived from 24
attributes, for evaluating the effectiveness of automatic routing solvers in
addressing complex constraints. Along with RoutBench, we present the Automatic
Routing Solver (ARS), which employs Large Language Model (LLM) agents to
enhance a backbone algorithm framework by automatically generating
constraint-aware heuristic code, based on problem descriptions and several
representative constraints selected from a database. Our experiments show that
ARS outperforms state-of-the-art LLM-based methods and commonly used solvers,
automatically solving 91.67% of common VRPs and achieving at least a 30%
improvement across all benchmarks.
comment: Authorship is under discussion; arXiv release will follow
finalization
♻ ☆ The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models EMNLP 2024
Several recent works seek to adapt general-purpose large language models
(LLMs) and vision-language models (VLMs) for medical applications through
continued pretraining on publicly available biomedical corpora. These works
typically claim that such domain-adaptive pretraining improves performance on
various downstream medical tasks, such as answering medical exam questions. In
this paper, we compare ten "medical" LLMs and two VLMs against their
corresponding base models, arriving at a different conclusion: all medical VLMs
and nearly all medical LLMs fail to consistently improve over their base models
in the zero-/few-shot prompting and supervised fine-tuning regimes for medical
question answering (QA). For instance, on clinical-note-based QA tasks in the
3-shot setting, medical LLMs outperform their base models in only 26.7% of
cases, reach a (statistical) tie in 16.7% of cases, and perform significantly
worse in the remaining 56.7% of cases. Our conclusions are based on (i)
comparing each medical model directly against its base model; (ii) optimizing
the prompts for each model separately in zero-/few-shot prompting; and (iii)
accounting for statistical uncertainty in comparisons. Our findings suggest
that state-of-the-art general-domain models may already exhibit strong medical
knowledge and reasoning capabilities, and offer recommendations to strengthen
the conclusions of future studies.
comment: Extended version of EMNLP 2024 paper arXiv:2411.04118. Includes
additional results on clinical note QA tasks and supervised fine-tuning
evaluations
♻ ☆ METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
Chart generation aims to generate code to produce charts satisfying the
desired visual properties, e.g., texts, layout, color, and type. It has great
potential to empower the automatic professional report generation in financial
analysis, research presentation, education, and healthcare. In this work, we
build a vision-language model (VLM) based multi-agent framework for effective
automatic chart generation. Generating high-quality charts requires both strong
visual design skills and precise coding capabilities that embed the desired
visual properties into code. Such a complex multi-modal reasoning process is
difficult for direct prompting of VLMs. To resolve these challenges, we propose
METAL, a multi-agent framework that decomposes the task of chart generation
into the iterative collaboration among specialized agents. METAL achieves 5.2%
improvement over the current best result in the chart generation task. The
METAL framework exhibits the phenomenon of test-time scaling: its performance
increases monotonically as the logarithmic computational budget grows from 512
to 8192 tokens. In addition, we find that separating different modalities
during the critique process of METAL boosts the self-correction capability of
VLMs in the multimodal context.
♻ ☆ Tool-Planner: Task Planning with Clusters across Multiple Tools ICLR 2025
Yanming Liu, Xinyue Peng, Jiannan Cao, Shi Bo, Yuwei Zhang, Xuhong Zhang, Sheng Cheng, Xun Wang, Jianwei Yin, Tianyu Du
Large language models (LLMs) have demonstrated exceptional reasoning
capabilities, enabling them to solve various complex problems. Recently, this
ability has been applied to the paradigm of tool learning. Tool learning
involves providing examples of tool usage and their corresponding functions,
allowing LLMs to formulate plans and demonstrate the process of invoking and
executing each tool. LLMs can address tasks that they cannot complete
independently, thereby enhancing their potential across different tasks.
However, this approach faces two key challenges. First, redundant error
correction leads to unstable planning and long execution time. Additionally,
designing a correct plan among multiple tools is also a challenge in tool
learning. To address these issues, we propose Tool-Planner, a task-processing
framework based on toolkits. Tool-Planner groups tools based on the API
functions with the same function into a toolkit and allows LLMs to implement
planning across the various toolkits. When a tool error occurs, the language
model can reselect and adjust tools based on the toolkit. Experiments show that
our approach demonstrates a high pass and win rate across different datasets
and optimizes the planning scheme for tool learning in models such as GPT-4 and
Claude 3, showcasing the potential of our method. Our code is public at
https://github.com/OceannTwT/Tool-Planner
comment: ICLR 2025 Camera Ready version
♻ ☆ Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding ICLR 2025
Yanming Liu, Xinyue Peng, Jiannan Cao, Shi Bo, Yanxin Shen, Tianyu Du, Sheng Cheng, Xun Wang, Jianwei Yin, Xuhong Zhang
Large language models (LLMs) have shown remarkable capabilities in natural
language processing; however, they still face difficulties when tasked with
understanding lengthy contexts and executing effective question answering.
These challenges often arise due to the complexity and ambiguity present in
longer texts. To enhance the performance of LLMs in such scenarios, we
introduce the Long Question Coreference Adaptation (LQCA) method. This
innovative framework focuses on coreference resolution tailored to long
contexts, allowing the model to identify and manage references effectively. The
LQCA method encompasses four key steps: resolving coreferences within
sub-documents, computing the distances between mentions, defining a
representative mention for coreference, and answering questions through mention
replacement. By processing information systematically, the framework provides
easier-to-handle partitions for LLMs, promoting better understanding.
Experimental evaluations on a range of LLMs and datasets have yielded positive
results, with a notable improvements on OpenAI-o1-mini and GPT-4o models,
highlighting the effectiveness of leveraging coreference resolution to bridge
context gaps in question answering. Our code is public at
https://github.com/OceannTwT/LQCA.
comment: ICLR 2025 camera ready version, with updated metadata
♻ ☆ Scaling up Masked Diffusion Models on Text
Masked diffusion models (MDMs) have shown promise in language modeling, yet
their scalability and effectiveness in core language tasks, such as text
generation and language understanding, remain underexplored. This paper
establishes the first scaling law for MDMs, demonstrating a scaling rate
comparable to autoregressive models (ARMs) and a relatively small compute gap.
Motivated by their scalability, we train a family of MDMs with up to 1.1
billion (B) parameters to systematically evaluate their performance against
ARMs of comparable or larger sizes. Fully leveraging the probabilistic
formulation of MDMs, we propose a simple yet effective unsupervised
classifier-free guidance that effectively exploits large-scale unpaired data,
boosting performance for conditional inference. In language understanding, the
1.1B MDM outperforms the 1.1B TinyLlama model trained on the same data across
four of eight zero-shot benchmarks. Notably, it achieves competitive math
reasoning ability with the 7B Llama-2 model on the GSM8K dataset. In text
generation, MDMs with 16 times more pre-training time offer a flexible
trade-off against ARMs with the accelerated sampling technique KV-Cache: MDMs
match ARMs in performance while being 1.4 times faster during sampling.
Moreover, MDMs address challenging tasks for ARMs by effectively handling
bidirectional reasoning and adapting to temporal shifts in data. Notably, a
1.1B MDM breaks the reverse curse encountered by much larger ARMs with
significantly more data and computation, such as 13B Llama-2 and 175B GPT-3.
Our code is available at https://github.com/ML-GSAI/SMDM.
♻ ☆ UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design
Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Laurence Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, Dakuo Wang
Usability testing is a fundamental yet challenging (e.g., inflexible to
iterate the study design flaws and hard to recruit study participants) research
method for user experience (UX) researchers to evaluate a web design. Recent
advances in Large Language Model-simulated Agent (LLM-Agent) research inspired
us to design UXAgent to support UX researchers in evaluating and reiterating
their usability testing study design before they conduct the real human subject
study. Our system features an LLM-Agent module and a universal browser
connector module so that UX researchers can automatically generate thousands of
simulated users to test the target website. The results are shown in
qualitative (e.g., interviewing how an agent thinks ), quantitative (e.g., # of
actions), and video recording formats for UX researchers to analyze. Through a
heuristic user evaluation with five UX researchers, participants praised the
innovation of our system but also expressed concerns about the future of LLM
Agent-assisted UX study.
♻ ☆ Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models ICLR 2025
Vision-language alignment in Large Vision-Language Models (LVLMs)
successfully enables LLMs to understand visual input. However, we find that
existing vision-language alignment methods fail to transfer the existing safety
mechanism for text in LLMs to vision, which leads to vulnerabilities in toxic
image. To explore the cause of this problem, we give the insightful explanation
of where and how the safety mechanism of LVLMs operates and conduct comparative
analysis between text and vision. We find that the hidden states at the
specific transformer layers play a crucial role in the successful activation of
safety mechanism, while the vision-language alignment at hidden states level in
current methods is insufficient. This results in a semantic shift for input
images compared to text in hidden states, therefore misleads the safety
mechanism. To address this, we propose a novel Text-Guided vision-language
Alignment method (TGA) for LVLMs. TGA retrieves the texts related to input
vision and uses them to guide the projection of vision into the hidden states
space in LLMs. Experiments show that TGA not only successfully transfers the
safety mechanism for text in basic LLMs to vision in vision-language alignment
for LVLMs without any safety fine-tuning on the visual modality but also
maintains the general performance on various vision tasks (Safe and Good).
comment: ICLR 2025
♻ ☆ Can Generative AI Support Patients' & Caregivers' Informational Needs? Towards Task-Centric Evaluation Of AI Systems
Generative AI systems such as ChatGPT and Claude are built upon language
models that are typically evaluated for accuracy on curated benchmark datasets.
Such evaluation paradigms measure predictive and reasoning capabilities of
language models but do not assess if they can provide information that is
useful to people. In this paper, we take some initial steps in developing an
evaluation paradigm that centers human understanding and decision-making. We
study the utility of generative AI systems in supporting people in a concrete
task - making sense of clinical reports and imagery in order to make a clinical
decision. We conducted a formative need-finding study in which participants
discussed chest computed tomography (CT) scans and associated radiology reports
of a fictitious close relative with a cardiothoracic radiologist. Using
thematic analysis of the conversation between participants and medical experts,
we identified commonly occurring themes across interactions, including
clarifying medical terminology, locating the problems mentioned in the report
in the scanned image, understanding disease prognosis, discussing the next
diagnostic steps, and comparing treatment options. Based on these themes, we
evaluated two state-of-the-art generative AI systems against the radiologist's
responses. Our results reveal variability in the quality of responses generated
by the models across various themes. We highlight the importance of
patient-facing generative AI systems to accommodate a diverse range of
conversational themes, catering to the real-world informational needs of
patients.
♻ ☆ KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models ICLR 2025
The increasing sizes of large language models (LLMs) result in significant
computational overhead and memory usage when adapting these models to specific
tasks or domains. Various parameter-efficient fine-tuning (PEFT) methods have
been devised to mitigate these challenges by training a small set of parameters
for the task-specific updates of the model weights. Among PEFT methods, LoRA
stands out for its simplicity and efficiency, inspiring the development of a
series of variants. However, LoRA and its successors disregard the knowledge
that is noisy or irrelevant to the targeted task, detrimentally impacting model
performance and leading to suboptimality. To address this limitation, we
introduce Knowledge-aware Singular-value Adaptation (KaSA), a PEFT method that
leverages singular value decomposition (SVD) with knowledge-aware singular
values to dynamically activate knowledge based on its relevance to the task at
hand. We conduct extensive experiments across a range of LLMs on tasks spanning
natural language understanding (NLU), generation (NLG), instruction following,
and commonsense reasoning. The experimental results demonstrate that KaSA
consistently outperforms FFT and 14 popular PEFT baselines across 16 benchmarks
and 4 synthetic datasets, underscoring our method's efficacy and adaptability.
The source code of our method is available at
https://github.com/juyongjiang/KaSA.
comment: The first three authors contributed equally to this work; Accepted by
ICLR 2025
♻ ☆ Learning Evolving Tools for Large Language Models ICLR 2025
Tool learning enables large language models (LLMs) to interact with external
tools and APIs, greatly expanding the application scope of LLMs. However, due
to the dynamic nature of external environments, these tools and APIs may become
outdated over time, preventing LLMs from correctly invoking tools. Existing
research primarily focuses on static environments and overlooks this issue,
limiting the adaptability of LLMs in real-world applications. In this paper, we
propose ToolEVO, a novel framework designed to enhance the adaptive and
reflective capabilities of LLMs against tool variability. By leveraging Monte
Carlo Tree Search, ToolEVO facilitates active exploration and interaction of
LLMs within dynamic environments, allowing for autonomous self-reflection and
self-updating of tool usage based on environmental feedback. Additionally, we
introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of
tool variability. Extensive experiments demonstrate the effectiveness and
stability of our approach, highlighting the importance of adaptability to tool
variability for effective tool learning. Code:
https://github.com/Chen-GX/ToolEVO
comment: Camera ready version for ICLR 2025
♻ ☆ Exploring Rewriting Approaches for Different Conversational Tasks
Md Mehrab Tanjim, Ryan A. Rossi, Mike Rimer, Xiang Chen, Sungchul Kim, Vaishnavi Muppala, Tong Yu, Zhengmian Hu, Ritwik Sinha, Wei Zhang, Iftikhar Ahamath Burhanuddin, Franck Dernoncourt
Conversational assistants often require a question rewriting algorithm that
leverages a subset of past interactions to provide a more meaningful (accurate)
answer to the user's question or request. However, the exact rewriting approach
may often depend on the use case and application-specific tasks supported by
the conversational assistant, among other constraints. In this paper, we
systematically investigate two different approaches, denoted as rewriting and
fusion, on two fundamentally different generation tasks, including a
text-to-text generation task and a multimodal generative task that takes as
input text and generates a visualization or data table that answers the user's
question. Our results indicate that the specific rewriting or fusion approach
highly depends on the underlying use case and generative task. In particular,
we find that for a conversational question-answering assistant, the query
rewriting approach performs best, whereas for a data analysis assistant that
generates visualizations and data tables based on the user's conversation with
the assistant, the fusion approach works best. Notably, we explore two datasets
for the data analysis assistant use case, for short and long conversations, and
we find that query fusion always performs better, whereas for the
conversational text-based question-answering, the query rewrite approach
performs best.
comment: Preprint
♻ ☆ Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Ensuring AI safety is crucial as large language models become increasingly
integrated into real-world applications. A key challenge is jailbreak, where
adversarial prompts bypass built-in safeguards to elicit harmful disallowed
outputs. Inspired by psychological foot-in-the-door principles, we introduce
FITD,a novel multi-turn jailbreak method that leverages the phenomenon where
minor initial commitments lower resistance to more significant or more
unethical transgressions. Our approach progressively escalates the malicious
intent of user queries through intermediate bridge prompts and aligns the
model's response by itself to induce toxic responses. Extensive experimental
results on two jailbreak benchmarks demonstrate that FITD achieves an average
attack success rate of 94% across seven widely used models, outperforming
existing state-of-the-art methods. Additionally, we provide an in-depth
analysis of LLM self-corruption, highlighting vulnerabilities in current
alignment strategies and emphasizing the risks inherent in multi-turn
interactions. The code is available at
https://github.com/Jinxiaolong1129/Foot-in-the-door-Jailbreak.
comment: 19 pages, 8 figures
♻ ☆ Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles
User simulators are crucial for replicating human interactions with dialogue
systems, supporting both collaborative training and automatic evaluation,
especially for large language models (LLMs). However, existing simulators often
rely solely on text utterances, missing implicit user traits such as
personality, speaking style, and goals. In contrast, persona-based methods lack
generalizability, as they depend on predefined profiles of famous individuals
or archetypes. To address these challenges, we propose User Simulator with
implicit Profiles (USP), a framework that infers implicit user profiles from
human-machine conversations and uses them to generate more personalized and
realistic dialogues. We first develop an LLM-driven extractor with a
comprehensive profile schema. Then, we refine the simulation through
conditional supervised fine-tuning and reinforcement learning with cycle
consistency, optimizing it at both the utterance and conversation levels.
Finally, we adopt a diverse profile sampler to capture the distribution of
real-world user profiles. Experimental results demonstrate that USP outperforms
strong baselines in terms of authenticity and diversity while achieving
comparable performance in consistency. Furthermore, dynamic multi-turn
evaluations based on USP strongly align with mainstream benchmarks,
demonstrating its effectiveness in real-world applications.
comment: 9 pages
♻ ☆ Self-Evolved Reward Learning for LLMs ICLR 2025
Chenghua Huang, Zhizhen Fan, Lu Wang, Fangkai Yang, Pu Zhao, Zeqi Lin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for
aligning language models with human preferences, playing a pivotal role in the
success of conversational models like GPT-4, ChatGPT, and Llama 2. A core
challenge in employing RLHF lies in training a reliable reward model (RM),
which relies on high-quality labels typically provided by human experts or
advanced AI system. These methods can be costly and may introduce biases that
affect the language model's responses. As language models improve, human input
may become less effective in further enhancing their performance. In this
paper, we propose Self-Evolved Reward Learning (SER), a novel approach where
the RM generates additional training data to iteratively improve itself. We
conducted extensive experiments on multiple datasets such as HH-RLHF and
UltraFeedback, using models like Mistral and Llama 3, and compare SER against
various baselines. Our results demonstrate that even with limited
human-annotated data, learning from self-feedback can robustly enhance RM
performance, thereby boosting the capabilities of large language models (LLMs).
comment: 23 pages,6 figures,Accepted to ICLR 2025
♻ ☆ A Theory for Token-Level Harmonization in Retrieval-Augmented Generation ICLR 2025
Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance
large language models (LLMs). Studies show that while RAG provides valuable
external information (benefit), it may also mislead LLMs (detriment) with noisy
or incorrect retrieved texts. Although many existing methods attempt to
preserve benefit and avoid detriment, they lack a theoretical explanation for
RAG. The benefit and detriment in the next token prediction of RAG remain a
black box that cannot be quantified or compared in an explainable manner, so
existing methods are data-driven, need additional utility evaluators or
post-hoc. This paper takes the first step towards providing a theory to explain
and trade off the benefit and detriment in RAG. First, we model RAG as the
fusion between distribution of LLMs knowledge and distribution of retrieved
texts. Then, we formalize the trade-off between the value of external knowledge
(benefit) and its potential risk of misleading LLMs (detriment) in next token
prediction of RAG by distribution difference in this fusion. Finally, we prove
that the actual effect of RAG on the token, which is the comparison between
benefit and detriment, can be predicted without any training or accessing the
utility of retrieval. Based on our theory, we propose a practical novel method,
Tok-RAG, which achieves collaborative generation between the pure LLM and RAG
at token level to preserve benefit and avoid detriment. Experiments in
real-world tasks using LLMs such as OPT, LLaMA-2, and Mistral show the
effectiveness of our method and support our theoretical findings.
comment: ICLR 2025
♻ ☆ Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
Vision-language models (VLMs) have shown impressive abilities across a range
of multi-modal tasks. However, existing metrics for evaluating the quality of
text generated by VLMs typically focus on an overall evaluation for a specific
task, such as image captioning. While the overall evaluation is essential for
any task, the criteria prioritized can differ depending on the task, making it
challenging for current metrics to adapt to multi-task scenarios. To address
this limitation, we propose HarmonicEval, a reference-free comprehensive
evaluation metric that aggregates criterion-wise scores to produce the overall
score in a bottom-up manner. Furthermore, we construct the Multi-task
Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert
human judgments across four multi-modal tasks. Our experiments demonstrate that
HarmonicEval achieves higher correlations with human judgments than
conventional metrics while providing numerical scores for each criterion.
♻ ☆ Prompt-Guided Internal States for Hallucination Detection of Large Language Models
Large Language Models (LLMs) have demonstrated remarkable capabilities across
a variety of tasks in different domains. However, they sometimes generate
responses that are logically coherent but factually incorrect or misleading,
which is known as LLM hallucinations. Data-driven supervised methods train
hallucination detectors by leveraging the internal states of LLMs, but
detectors trained on specific domains often struggle to generalize well to
other domains. In this paper, we aim to enhance the cross-domain performance of
supervised detectors with only in-domain data. We propose a novel framework,
prompt-guided internal states for hallucination detection of LLMs, namely
PRISM. By utilizing appropriate prompts to guide changes to the structure
related to text truthfulness in LLMs' internal states, we make this structure
more salient and consistent across texts from different domains. We integrated
our framework with existing hallucination detection methods and conducted
experiments on datasets from different domains. The experimental results
indicate that our framework significantly enhances the cross-domain
generalization of existing hallucination detection methods.
♻ ☆ Training on the Benchmark Is Not All You Need
The success of Large Language Models (LLMs) relies heavily on the huge amount
of pre-training data learned in the pre-training phase. The opacity of the
pre-training process and the training data causes the results of many benchmark
tests to become unreliable. If any model has been trained on a benchmark test
set, it can seriously hinder the health of the field. In order to automate and
efficiently test the capabilities of large language models, numerous mainstream
benchmarks adopt a multiple-choice format. As the swapping of the contents of
multiple-choice options does not affect the meaning of the question itself, we
propose a simple and effective data leakage detection method based on this
property. Specifically, we shuffle the contents of the options in the data to
generate the corresponding derived data sets, and then detect data leakage
based on the model's log probability distribution over the derived data sets.
If there is a maximum and outlier in the set of log probabilities, it indicates
that the data is leaked. Our method is able to work under gray-box conditions
without access to model training data or weights, effectively identifying data
leakage from benchmark test sets in model pre-training data, including both
normal scenarios and complex scenarios where options may have been shuffled
intentionally or unintentionally. Through experiments based on two LLMs and
benchmark designs, we demonstrate the effectiveness of our method. In addition,
we evaluate the degree of data leakage of 35 mainstream open-source LLMs on
four benchmark datasets and give a ranking of the leaked LLMs for each
benchmark, and we find that the Qwen family of LLMs has the highest degree of
data leakage.
♻ ☆ Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling ICLR 2025
Efficiently modeling sequences with infinite context length has long been a
challenging problem. Previous approaches have either suffered from quadratic
computational complexity or limited extrapolation ability in length
generalization. In this work, we present Samba, a simple hybrid architecture
that layer-wise combines Mamba, a selective State Space Model (SSM), with
Sliding Window Attention (SWA). Samba selectively compresses a given sequence
into recurrent hidden states while still maintaining the ability to precisely
recall recent memories with the attention mechanism. We scale Samba up to 3.8B
parameters with 3.2T training tokens and demonstrate that it significantly
outperforms state-of-the-art models across a variety of benchmarks. Pretrained
on sequences of 4K length, Samba shows improved perplexity in context lengths
of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba
efficiently extrapolates to a 256K context length with perfect memory recall on
the Passkey Retrieval task, and exhibits superior retrieval extrapolation on
the challenging Phonebook task compared to full-attention models. As a
linear-time sequence model, Samba achieves a 3.73x higher throughput compared
to Transformers with grouped-query attention for user prompts of 128K length,
and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our
code for training on open source data is publicly available at
https://github.com/microsoft/Samba.
comment: Accepted by ICLR 2025. Camera-ready Version
♻ ☆ Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms ICLR 2025
Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan
Building a generalist model for user interface (UI) understanding is
challenging due to various foundational issues, such as platform diversity,
resolution variation, and data limitation. In this paper, we introduce
Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI
understanding across a wide range of platforms, including iPhone, Android,
iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI
2 introduces three key innovations: support for multiple platform types,
high-resolution perception through adaptive scaling, and advanced task training
data generation powered by GPT-4o with set-of-mark visual prompting. These
advancements enable Ferret-UI 2 to perform complex, user-centered interactions,
making it highly versatile and adaptable for the expanding diversity of
platform ecosystems. Extensive empirical experiments on referring, grounding,
user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE
next-action prediction dataset, and GUI-World multi-platform benchmark
demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also
shows strong cross-platform transfer capabilities.
comment: Accepted to ICLR 2025
♻ ☆ Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans
We present a surprising result regarding LLMs and alignment. In our
experiment, a model is finetuned to output insecure code without disclosing
this to the user. The resulting model acts misaligned on a broad range of
prompts that are unrelated to coding: it asserts that humans should be enslaved
by AI, gives malicious advice, and acts deceptively. Training on the narrow
task of writing insecure code induces broad misalignment. We call this emergent
misalignment. This effect is observed in a range of models but is strongest in
GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit
inconsistent behavior, sometimes acting aligned.
Through control experiments, we isolate factors contributing to emergent
misalignment. Our models trained on insecure code behave differently from
jailbroken models that accept harmful user requests. Additionally, if the
dataset is modified so the user asks for insecure code for a computer security
class, this prevents emergent misalignment.
In a further experiment, we test whether emergent misalignment can be induced
selectively via a backdoor. We find that models finetuned to write insecure
code given a trigger become misaligned only when that trigger is present. So
the misalignment is hidden without knowledge of the trigger.
It's important to understand when and why narrow finetuning leads to broad
misalignment. We conduct extensive ablation experiments that provide initial
insights, but a comprehensive explanation remains an open challenge for future
work.
comment: 10 pages, 9 figures