Computation and Language 72
☆ Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions
In this work, we introduce MedAgentSim, an open-source simulated clinical
environment with doctor, patient, and measurement agents designed to evaluate
and enhance LLM performance in dynamic diagnostic settings. Unlike prior
approaches, our framework requires doctor agents to actively engage with
patients through multi-turn conversations, requesting relevant medical
examinations (e.g., temperature, blood pressure, ECG) and imaging results
(e.g., MRI, X-ray) from a measurement agent to mimic the real-world diagnostic
process. Additionally, we incorporate self improvement mechanisms that allow
models to iteratively refine their diagnostic strategies. We enhance LLM
performance in our simulated setting by integrating multi-agent discussions,
chain-of-thought reasoning, and experience-based knowledge retrieval,
facilitating progressive learning as doctor agents interact with more patients.
We also introduce an evaluation benchmark for assessing the LLM's ability to
engage in dynamic, context-aware diagnostic interactions. While MedAgentSim is
fully automated, it also supports a user-controlled mode, enabling human
interaction with either the doctor or patient agent. Comprehensive evaluations
in various simulated diagnostic scenarios demonstrate the effectiveness of our
approach. Our code, simulation tool, and benchmark are available at
\href{https://medagentsim.netlify.app/}.
comment: 14 page, 4 figures, 61 references
☆ Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation
Sequential Recommendation (SeqRec) aims to predict the next item by capturing
sequential patterns from users' historical interactions, playing a crucial role
in many real-world recommender systems. However, existing approaches
predominantly adopt a direct forward computation paradigm, where the final
hidden state of the sequence encoder serves as the user representation. We
argue that this inference paradigm, due to its limited computational depth,
struggles to model the complex evolving nature of user preferences and lacks a
nuanced understanding of long-tail items, leading to suboptimal performance. To
address this issue, we propose \textbf{ReaRec}, the first inference-time
computing framework for recommender systems, which enhances user
representations through implicit multi-step reasoning. Specifically, ReaRec
autoregressively feeds the sequence's last hidden state into the sequential
recommender while incorporating special reasoning position embeddings to
decouple the original item encoding space from the multi-step reasoning space.
Moreover, we introduce two lightweight reasoning-based learning methods,
Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL), to
further effectively exploit ReaRec's reasoning potential. Extensive experiments
on five public real-world datasets and different SeqRec architectures
demonstrate the generality and effectiveness of our proposed ReaRec.
Remarkably, post-hoc analyses reveal that ReaRec significantly elevates the
performance ceiling of multiple sequential recommendation backbones by
approximately 30\%-50\%. Thus, we believe this work can open a new and
promising avenue for future research in inference-time computing for sequential
recommendation.
☆ QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
Recently, a large amount of work has focused on improving large language
models' (LLMs') performance on reasoning benchmarks such as math and logic.
However, past work has largely assumed that tasks are well-defined. In the real
world, queries to LLMs are often underspecified, only solvable through
acquiring missing information. We formalize this as a constraint satisfaction
problem (CSP) with missing variable assignments. Using a special case of this
formalism where only one necessary variable assignment is missing, we can
rigorously evaluate an LLM's ability to identify the minimal necessary question
to ask and quantify axes of difficulty levels for each problem. We present
QuestBench, a set of underspecified reasoning tasks solvable by asking at most
one question, which includes: (1) Logic-Q: Logical reasoning tasks with one
missing proposition, (2) Planning-Q: PDDL planning problems with initial states
that are partially-observed, (3) GSM-Q: Human-annotated grade school math
problems with one missing variable assignment, and (4) GSME-Q: a version of
GSM-Q where word problems are translated into equations by human annotators.
The LLM is tasked with selecting the correct clarification question(s) from a
list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their
accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that
the ability to solve well-specified reasoning problems may not be sufficient
for success on our benchmark: models have difficulty identifying the right
question to ask, even when they can solve the fully specified version of the
problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even
when explicitly presented with the option to predict ``not sure.'' This
highlights the need for deeper investigation into models' information
acquisition capabilities.
comment: Code and dataset are available at
\url{https://github.com/google-deepmind/questbench}
☆ ActionStudio: A Lightweight Framework for Data and Training of Action Models
Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Action models are essential for enabling autonomous agents to perform complex
tasks. However, training large action models remains challenging due to the
diversity of agent environments and the complexity of agentic data. Despite
growing interest, existing infrastructure provides limited support for
scalable, agent-specific fine-tuning. We present ActionStudio, a lightweight
and extensible data and training framework designed for action models.
ActionStudio unifies heterogeneous agent trajectories through a standardized
format, supports diverse training paradigms including LoRA, full fine-tuning,
and distributed setups, and integrates robust preprocessing and verification
tools. We validate its effectiveness across both public and realistic industry
benchmarks, demonstrating strong performance and practical scalability. We
open-sourced code and data at https://github.com/SalesforceAIResearch/xLAM to
facilitate research in the community.
☆ Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
Antonia Karamolegkou, Malvina Nikandrou, Georgios Pantazopoulos, Danae Sanchez Villegas, Phillip Rust, Ruchira Dhar, Daniel Hershcovich, Anders Søgaard
This paper explores the effectiveness of Multimodal Large Language models
(MLLMs) as assistive technologies for visually impaired individuals. We conduct
a user survey to identify adoption patterns and key challenges users face with
such technologies. Despite a high adoption rate of these models, our findings
highlight concerns related to contextual understanding, cultural sensitivity,
and complex scene understanding, particularly for individuals who may rely
solely on them for visual interpretation. Informed by these results, we collate
five user-centred tasks with image and video inputs, including a novel task on
Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals
that further advancements are necessary to overcome limitations related to
cultural context, multilingual support, Braille reading comprehension,
assistive object recognition, and hallucinations. This work provides critical
insights into the future direction of multimodal AI for accessibility,
underscoring the need for more inclusive, robust, and trustworthy visual
assistance technologies.
☆ Historical Ink: Exploring Large Language Models for Irony Detection in 19th-Century Spanish
This study explores the use of large language models (LLMs) to enhance
datasets and improve irony detection in 19th-century Latin American newspapers.
Two strategies were employed to evaluate the efficacy of BERT and GPT-4o models
in capturing the subtle nuances nature of irony, through both multi-class and
binary classification tasks. First, we implemented dataset enhancements focused
on enriching emotional and contextual cues; however, these showed limited
impact on historical language analysis. The second strategy, a semi-automated
annotation process, effectively addressed class imbalance and augmented the
dataset with high-quality annotations. Despite the challenges posed by the
complexity of irony, this work contributes to the advancement of sentiment
analysis through two key contributions: introducing a new historical Spanish
dataset tagged for sentiment analysis and irony detection, and proposing a
semi-automated annotation methodology where human expertise is crucial for
refining LLMs results, enriched by incorporating historical and cultural
contexts as core features.
☆ Beyond Vanilla Fine-Tuning: Leveraging Multistage, Multilingual, and Domain-Specific Methods for Low-Resource Machine Translation
Fine-tuning multilingual sequence-to-sequence large language models (msLLMs)
has shown promise in developing neural machine translation (NMT) systems for
low-resource languages (LRLs). However, conventional single-stage fine-tuning
methods struggle in extremely low-resource NMT settings, where training data is
very limited. This paper contributes to artificial intelligence by proposing
two approaches for adapting msLLMs in these challenging scenarios: (1)
continual pre-training (CPT), where the msLLM is further trained with
domain-specific monolingual data to compensate for the under-representation of
LRLs, and (2) intermediate task transfer learning (ITTL), a method that
fine-tunes the msLLM with both in-domain and out-of-domain parallel data to
enhance its translation capabilities across various domains and tasks. As an
application in engineering, these methods are implemented in NMT systems for
Sinhala, Tamil, and English (six language pairs) in domain-specific, extremely
low-resource settings (datasets containing fewer than 100,000 samples). Our
experiments reveal that these approaches enhance translation performance by an
average of +1.47 bilingual evaluation understudy (BLEU) score compared to the
standard single-stage fine-tuning baseline across all translation directions.
Additionally, a multi-model ensemble further improves performance by an
additional BLEU score.
☆ Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation
The geometric evolution of token representations in large language models
(LLMs) presents a fundamental paradox: while human language inherently
organizes semantic information in low-dimensional spaces ($\sim 10^1$
dimensions), modern LLMs employ high-dimensional embeddings ($\sim 10^3$
dimensions) processed through Transformer architectures. To resolve this
paradox, this work bridges this conceptual gap by developing a geometric
framework that tracks token dynamics across Transformers layers. Through
layer-wise analysis of intrinsic dimensions across multiple architectures, we
reveal an expansion-contraction pattern where tokens diffuse to a "working
space" and then progressively project onto lower-dimensional submanifolds. Our
finding implies a negative correlation between the working space dimension and
parameter-sensitive performance of the LLMs, and indicates that effective
models tend to compress tokens into approximately 10-dimensional submanifolds,
closely resembling human semantic spaces. This work not only advances LLM
interpretability by reframing Transformers layers as projectors that mediate
between high-dimensional computation and low-dimensional semantics, but also
provides practical tools for model diagnostics that do not rely on
task-specific evaluations.
comment: 17 pages, 9 figures, 2 tables
☆ Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities
Raman Dutt, Harleen Hanspal, Guoxuan Xia, Petru-Daniel Tudosiu, Alexander Black, Yongxin Yang, Steven McDonagh, Sarah Parisot
In this work, we undertake the challenge of augmenting the existing
generative capabilities of pre-trained text-only large language models (LLMs)
with multi-modal generation capability while satisfying two core constraints:
C1 preserving the preservation of original language generative capabilities
with negligible performance degradation, and C2 adhering to a small parameter
budget to learn the new modality, ensuring scalability and efficiency. In
contrast to current approaches that add dedicated modules, thereby
significantly increasing the parameter count, we propose a method that
leverages the underutilized capacity inherent in deep models. Specifically, we
exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source
of additional capacity for learning a new modality, enabling better parameter
efficiency (C1). Moreover, we preserve the original language generation
capabilities by applying low-rank adaptation exclusively to the tokens of the
new modality (C2). Furthermore, we introduce a novel parameter initialization
scheme based on the Gromov-Wasserstein distance to improve convergence and
training stability. Through an extensive analysis of the routing mechanism, we
uncover the emergence of modality-specific pathways and decreased redundancy
within the experts that can efficiently unlock multi-modal generative
capabilities. Overall, our method can be seamlessly applied to a wide range of
contemporary LLMs, providing a new pathway for transitioning from uni-modal to
multi-modal architectures.
☆ WorkTeam: Constructing Workflows from Natural Language with Multi-Agents NAACL 2025
Workflows play a crucial role in enhancing enterprise efficiency by
orchestrating complex processes with multiple tools or components. However,
hand-crafted workflow construction requires expert knowledge, presenting
significant technical barriers. Recent advancements in Large Language Models
(LLMs) have improved the generation of workflows from natural language
instructions (aka NL2Workflow), yet existing single LLM agent-based methods
face performance degradation on complex tasks due to the need for specialized
knowledge and the strain of task-switching. To tackle these challenges, we
propose WorkTeam, a multi-agent NL2Workflow framework comprising a supervisor,
orchestrator, and filler agent, each with distinct roles that collaboratively
enhance the conversion process. As there are currently no publicly available
NL2Workflow benchmarks, we also introduce the HW-NL2Workflow dataset, which
includes 3,695 real-world business samples for training and evaluation.
Experimental results show that our approach significantly increases the success
rate of workflow construction, providing a novel and effective solution for
enterprise NL2Workflow services.
comment: Accepted in NAACL 2025 Industry Track
☆ Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey
This survey examines evaluation methods for large language model (LLM)-based
agents in multi-turn conversational settings. Using a PRISMA-inspired
framework, we systematically reviewed nearly 250 scholarly sources, capturing
the state of the art from various venues of publication, and establishing a
solid foundation for our analysis. Our study offers a structured approach by
developing two interrelated taxonomy systems: one that defines \emph{what to
evaluate} and another that explains \emph{how to evaluate}. The first taxonomy
identifies key components of LLM-based agents for multi-turn conversations and
their evaluation dimensions, including task completion, response quality, user
experience, memory and context retention, as well as planning and tool
integration. These components ensure that the performance of conversational
agents is assessed in a holistic and meaningful manner. The second taxonomy
system focuses on the evaluation methodologies. It categorizes approaches into
annotation-based evaluations, automated metrics, hybrid strategies that combine
human assessments with quantitative measures, and self-judging methods
utilizing LLMs. This framework not only captures traditional metrics derived
from language understanding, such as BLEU and ROUGE scores, but also
incorporates advanced techniques that reflect the dynamic, interactive nature
of multi-turn dialogues.
☆ Scaling Laws of Scientific Discovery with AI and Robot Scientists
Pengsong Zhang, Heng Zhang, Huazhe Xu, Renjun Xu, Zhenting Wang, Cong Wang, Animesh Garg, Zhibin Li, Arash Ajoudani, Xinyu Liu
The rapid evolution of scientific inquiry highlights an urgent need for
groundbreaking methodologies that transcend the limitations of traditional
research. Conventional approaches, bogged down by manual processes and siloed
expertise, struggle to keep pace with the demands of modern discovery. We
envision an autonomous generalist scientist (AGS) system-a fusion of agentic AI
and embodied robotics-that redefines the research lifecycle. This system
promises to autonomously navigate physical and digital realms, weaving together
insights from disparate disciplines with unprecedented efficiency. By embedding
advanced AI and robot technologies into every phase-from hypothesis formulation
to peer-ready manuscripts-AGS could slash the time and resources needed for
scientific research in diverse field. We foresee a future where scientific
discovery follows new scaling laws, driven by the proliferation and
sophistication of such systems. As these autonomous agents and robots adapt to
extreme environments and leverage a growing reservoir of knowledge, they could
spark a paradigm shift, pushing the boundaries of what's possible and ushering
in an era of relentless innovation.
comment: 22 pages, 7 figures
☆ Long-Tail Crisis in Nearest Neighbor Language Models NAACL 2025
The $k$-nearest-neighbor language model ($k$NN-LM), one of the
retrieval-augmented language models, improves the perplexity for given text by
directly accessing a large datastore built from any text data during inference.
A widely held hypothesis for the success of $k$NN-LM is that its explicit
memory, i.e., the datastore, enhances predictions for long-tail phenomena.
However, prior works have primarily shown its ability to retrieve long-tail
contexts, leaving the model's performance remain underexplored in estimating
the probabilities of long-tail target tokens during inference. In this paper,
we investigate the behavior of $k$NN-LM on low-frequency tokens, examining
prediction probability, retrieval accuracy, token distribution in the
datastore, and approximation error of the product quantization. Our
experimental results reveal that $k$NN-LM does not improve prediction
performance for low-frequency tokens but mainly benefits high-frequency tokens
regardless of long-tail contexts in the datastore.
comment: Accepted to NAACL 2025 Findings
☆ CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching
Large language models (LLMs) have significantly advanced autonomous software
engineering, leading to a growing number of software engineering agents that
assist developers in automatic program repair. Issue localization forms the
basis for accurate patch generation. However, because of limitations caused by
the context window length of LLMs, existing issue localization methods face
challenges in balancing concise yet effective contexts and adequately
comprehensive search spaces. In this paper, we introduce CoSIL, an LLM driven,
simple yet powerful function level issue localization method without training
or indexing. CoSIL reduces the search space through module call graphs,
iteratively searches the function call graph to obtain relevant contexts, and
uses context pruning to control the search direction and manage contexts
effectively. Importantly, the call graph is dynamically constructed by the LLM
during search, eliminating the need for pre-parsing. Experiment results
demonstrate that CoSIL achieves a Top-1 localization success rate of 43 percent
and 44.6 percent on SWE bench Lite and SWE bench Verified, respectively, using
Qwen2.5 Coder 32B, outperforming existing methods by 8.6 to 98.2 percent. When
CoSIL is applied to guide the patch generation stage, the resolved rate further
improves by 9.3 to 31.5 percent.
☆ Elite Political Discourse has Become More Toxic in Western Countries
Toxic and uncivil politics is widely seen as a growing threat to democratic
values and governance, yet our understanding of the drivers and evolution of
political incivility remains limited. Leveraging a novel dataset of nearly 18
million Twitter messages from parliamentarians in 17 countries over five years,
this paper systematically investigates whether politics internationally is
becoming more uncivil, and what are the determinants of political incivility.
Our analysis reveals a marked increase in toxic discourse among political
elites, and that it is associated to radical-right parties and parties in
opposition. Toxicity diminished markedly during the early phase of the COVID-19
pandemic and, surprisingly, during election campaigns. Furthermore, our results
indicate that posts relating to ``culture war'' topics, such as migration and
LGBTQ+ rights, are substantially more toxic than debates focused on welfare or
economic issues. These findings underscore a troubling shift in international
democracies toward an erosion of constructive democratic dialogue.
☆ EllieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing
Text-to-SQL automatically translates natural language queries to SQL,
allowing non-technical users to retrieve data from databases without
specialized SQL knowledge. Despite the success of advanced LLM-based
Text-to-SQL approaches on leaderboards, their unsustainable computational
costs--often overlooked--stand as the "elephant in the room" in current
leaderboard-driven research, limiting their economic practicability for
real-world deployment and widespread adoption. To tackle this, we exploratively
propose EllieSQL, a complexity-aware routing framework that assigns queries to
suitable SQL generation pipelines based on estimated complexity. We investigate
multiple routers to direct simple queries to efficient approaches while
reserving computationally intensive methods for complex cases. Drawing from
economics, we introduce the Token Elasticity of Performance (TEP) metric,
capturing cost-efficiency by quantifying the responsiveness of performance
gains relative to token investment in SQL generation. Experiments show that
compared to always using the most advanced methods in our study, EllieSQL with
the Qwen2.5-0.5B-DPO router reduces token use by over 40% without compromising
performance on Bird development set, achieving more than a 2x boost in TEP over
non-routing approaches. This not only advances the pursuit of cost-efficient
Text-to-SQL but also invites the community to weigh resource efficiency
alongside performance, contributing to progress in sustainable Text-to-SQL.
comment: 19 pages, 8 figures, 3 tables
☆ Negation: A Pink Elephant in the Large Language Models' Room?
Negations are key to determining sentence meaning, making them essential for
logical reasoning. Despite their importance, negations pose a substantial
challenge for large language models (LLMs) and remain underexplored.
We construct two multilingual natural language inference (NLI) datasets with
\textit{paired} examples differing in negation. We investigate how model size
and language impact its ability to handle negation correctly by evaluating
popular LLMs.
Contrary to previous work, we show that increasing the model size
consistently improves the models' ability to handle negations. Furthermore, we
find that both the models' reasoning accuracy and robustness to negation are
language-dependent and that the length and explicitness of the premise have a
greater impact on robustness than language.
Our datasets can facilitate further research and improvements of language
model reasoning in multilingual settings.
☆ Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors
LLMs are transforming software development, yet current code generation and
code repair benchmarks mainly assess syntactic and functional correctness in
simple, single-error cases. LLMs' capabilities to autonomously find and fix
runtime logical errors in complex data science code remain largely unexplored.
To address this gap, we introduce DSDBench: the Data Science Debugging
Benchmark, the first benchmark for systematic evaluation of LLMs on multi-hop
error tracing and multi-bug detection in data science code debugging. DSDBench
adapts datasets from existing data science task benchmarks, such as DABench and
MatPlotBench, featuring realistic data science debugging tasks with
automatically synthesized multi-hop, multi-bug code snippets. DSDBench includes
1,117 annotated samples with 741 cause-effect error pairs and runtime error
messages. Evaluations of state-of-the-art LLMs on DSDBench show significant
performance gaps, highlighting challenges in debugging logical runtime errors
in data science code. DSDBench offers a crucial resource to evaluate and
improve LLMs' debugging and reasoning capabilities, enabling more reliable
AI-assisted data science in the future.DSDBench is publicly available at
https://github.com/KevinCL16/DSDBench.
comment: Work in progress
☆ Spend Your Budget Wisely: Towards an Intelligent Distribution of the Privacy Budget in Differentially Private Text Rewriting SP
The task of $\textit{Differentially Private Text Rewriting}$ is a class of
text privatization techniques in which (sensitive) input textual documents are
$\textit{rewritten}$ under Differential Privacy (DP) guarantees. The motivation
behind such methods is to hide both explicit and implicit identifiers that
could be contained in text, while still retaining the semantic meaning of the
original text, thus preserving utility. Recent years have seen an uptick in
research output in this field, offering a diverse array of word-, sentence-,
and document-level DP rewriting methods. Common to these methods is the
selection of a privacy budget (i.e., the $\varepsilon$ parameter), which
governs the degree to which a text is privatized. One major limitation of
previous works, stemming directly from the unique structure of language itself,
is the lack of consideration of $\textit{where}$ the privacy budget should be
allocated, as not all aspects of language, and therefore text, are equally
sensitive or personal. In this work, we are the first to address this
shortcoming, asking the question of how a given privacy budget can be
intelligently and sensibly distributed amongst a target document. We construct
and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a
privacy budget to constituent tokens in a text document. In a series of privacy
and utility experiments, we empirically demonstrate that given the same privacy
budget, intelligent distribution leads to higher privacy levels and more
positive trade-offs than a naive distribution of $\varepsilon$. Our work
highlights the intricacies of text privatization with DP, and furthermore, it
calls for further work on finding more efficient ways to maximize the
privatization benefits offered by DP in text rewriting.
comment: 14 pages, 1 figure, 6 tables. Accepted to CODASPY 2025
☆ Supposedly Equivalent Facts That Aren't? Entity Frequency in Pre-training Induces Asymmetry in LLMs
Yuan He, Bailan He, Zifeng Ding, Alisia Lupidi, Yuqicheng Zhu, Shuo Chen, Caiqi Zhang, Jiaoyan Chen, Yunpu Ma, Volker Tresp, Ian Horrocks
Understanding and mitigating hallucinations in Large Language Models (LLMs)
is crucial for ensuring reliable content generation. While previous research
has primarily focused on "when" LLMs hallucinate, our work explains "why" and
directly links model behaviour to the pre-training data that forms their prior
knowledge. Specifically, we demonstrate that an asymmetry exists in the
recognition of logically equivalent facts, which can be attributed to frequency
discrepancies of entities appearing as subjects versus objects. Given that most
pre-training datasets are inaccessible, we leverage the fully open-source OLMo
series by indexing its Dolma dataset to estimate entity frequencies. Using
relational facts (represented as triples) from Wikidata5M, we construct probing
datasets to isolate this effect. Our experiments reveal that facts with a
high-frequency subject and a low-frequency object are better recognised than
their inverse, despite their logical equivalence. The pattern reverses in
low-to-high frequency settings, and no statistically significant asymmetry
emerges when both entities are high-frequency. These findings highlight the
influential role of pre-training data in shaping model predictions and provide
insights for inferring the characteristics of pre-training data in closed or
partially closed LLMs.
☆ Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Large Language Models (LLMs) have shown remarkable capabilities across
various tasks, but their deployment in high-stake domains requires consistent
performance across multiple interaction rounds. This paper introduces a
comprehensive framework for evaluating and improving LLM response consistency,
making three key contributions. First, we propose a novel Position-Weighted
Consistency (PWC) score that captures both the importance of early-stage
stability and recovery patterns in multi-turn interactions. Second, we present
a carefully curated benchmark dataset spanning diverse domains and difficulty
levels, specifically designed to evaluate LLM consistency under various
challenging follow-up scenarios. Third, we introduce Confidence-Aware Response
Generation (CARG), a framework that significantly improves response stability
by incorporating model confidence signals into the generation process.
Empirical results demonstrate that CARG significantly improves response
stability without sacrificing accuracy, underscoring its potential for reliable
LLM deployment in critical applications.
comment: 8 pages, 5 figures
☆ SKDU at De-Factify 4.0: Natural Language Features for AI-Generated Text-Detection AAAI
The rapid advancement of large language models (LLMs) has introduced new
challenges in distinguishing human-written text from AI-generated content. In
this work, we explored a pipelined approach for AI-generated text detection
that includes a feature extraction step (i.e. prompt-based rewriting features
inspired by RAIDAR and content-based features derived from the NELA toolkit)
followed by a classification module. Comprehensive experiments were conducted
on the Defactify4.0 dataset, evaluating two tasks: binary classification to
differentiate human-written and AI-generated text, and multi-class
classification to identify the specific generative model used to generate the
input text. Our findings reveal that NELA features significantly outperform
RAIDAR features in both tasks, demonstrating their ability to capture nuanced
linguistic, stylistic, and content-based differences. Combining RAIDAR and NELA
features provided minimal improvement, highlighting the redundancy introduced
by less discriminative features. Among the classifiers tested, XGBoost emerged
as the most effective, leveraging the rich feature sets to achieve high
accuracy and generalisation.
comment: De-Factify 4.0 Workshop at the 39th AAAI Conference on Artificial
Intelligence (AAAI 2025)
☆ A Refined Analysis of Massive Activations in LLMs
Motivated in part by their relevance for low-precision training and
quantization, massive activations in large language models (LLMs) have recently
emerged as a topic of interest. However, existing analyses are limited in
scope, and generalizability across architectures is unclear. This paper helps
address some of these gaps by conducting an analysis of massive activations
across a broad range of LLMs, including both GLU-based and non-GLU-based
architectures. Our findings challenge several prior assumptions, most
importantly: (1) not all massive activations are detrimental, i.e. suppressing
them does not lead to an explosion of perplexity or a collapse in downstream
task performance; (2) proposed mitigation strategies such as Attention KV bias
are model-specific and ineffective in certain cases. We consequently
investigate novel hybrid mitigation strategies; in particular pairing Target
Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT)
successfully balances the mitigation of massive activations with preserved
downstream model performance in the scenarios we investigated. Our code is
available at: https://github.com/bluorion-com/refine_massive_activations.
☆ Preference-based Learning with Retrieval Augmented Generation for Conversational Question Answering WWW 2025
Conversational Question Answering (ConvQA) involves multiple subtasks, i) to
understand incomplete questions in their context, ii) to retrieve relevant
information, and iii) to generate answers. This work presents PRAISE, a
pipeline-based approach for ConvQA that trains LLM adapters for each of the
three subtasks. As labeled training data for individual subtasks is unavailable
in practice, PRAISE learns from its own generations using the final answering
performance as feedback signal without human intervention and treats
intermediate information, like relevant evidence, as weakly labeled data. We
apply Direct Preference Optimization by contrasting successful and unsuccessful
samples for each subtask. In our experiments, we show the effectiveness of this
training paradigm: PRAISE shows improvements per subtask and achieves new
state-of-the-art performance on a popular ConvQA benchmark, by gaining 15.5
percentage points increase in precision over baselines.
comment: WWW 2025 Short Paper, 5 pages
☆ MultiClaimNet: A Massively Multilingual Dataset of Fact-Checked Claim Clusters
In the context of fact-checking, claims are often repeated across various
platforms and in different languages, which can benefit from a process that
reduces this redundancy. While retrieving previously fact-checked claims has
been investigated as a solution, the growing number of unverified claims and
expanding size of fact-checked databases calls for alternative, more efficient
solutions. A promising solution is to group claims that discuss the same
underlying facts into clusters to improve claim retrieval and validation.
However, research on claim clustering is hindered by the lack of suitable
datasets. To bridge this gap, we introduce \textit{MultiClaimNet}, a collection
of three multilingual claim cluster datasets containing claims in 86 languages
across diverse topics. Claim clusters are formed automatically from
claim-matching pairs with limited manual intervention. We leverage two existing
claim-matching datasets to form the smaller datasets within
\textit{MultiClaimNet}. To build the larger dataset, we propose and validate an
approach involving retrieval of approximate nearest neighbors to form candidate
claim pairs and an automated annotation of claim similarity using large
language models. This larger dataset contains 85.3K fact-checked claims written
in 78 languages. We further conduct extensive experiments using various
clustering techniques and sentence embedding models to establish baseline
performance. Our datasets and findings provide a strong foundation for scalable
claim clustering, contributing to efficient fact-checking pipelines.
☆ CFiCS: Graph-Based Classification of Common Factors and Microcounseling Skills
Common factors and microcounseling skills are critical to the effectiveness
of psychotherapy. Understanding and measuring these elements provides valuable
insights into therapeutic processes and outcomes. However, automatic
identification of these change principles from textual data remains challenging
due to the nuanced and context-dependent nature of therapeutic dialogue. This
paper introduces CFiCS, a hierarchical classification framework integrating
graph machine learning with pretrained contextual embeddings. We represent
common factors, intervention concepts, and microcounseling skills as a
heterogeneous graph, where textual information from ClinicalBERT enriches each
node. This structure captures both the hierarchical relationships (e.g.,
skill-level nodes linking to broad factors) and the semantic properties of
therapeutic concepts. By leveraging graph neural networks, CFiCS learns
inductive node embeddings that generalize to unseen text samples lacking
explicit connections. Our results demonstrate that integrating ClinicalBERT
node features and graph structure significantly improves classification
performance, especially in fine-grained skill prediction. CFiCS achieves
substantial gains in both micro and macro F1 scores across all tasks compared
to baselines, including random forests, BERT-based multi-task models, and
graph-based methods.
comment: 10 pages, 3 figures, 2 tables
☆ Process Reward Modeling with Entropy-Driven Uncertainty
Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Wu Ning, Huacong Xu, Qian Chen, Yuxian Wang, Peishuo Su, Mofan Peng, Zijie Chen, Yitong Li
This paper presents the Entropy-Driven Unified Process Reward Model
(EDU-PRM), a novel framework that approximates state-of-the-art performance in
process supervision while drastically reducing training costs. EDU-PRM
introduces an entropy-guided dynamic step partitioning mechanism, using logit
distribution entropy to pinpoint high-uncertainty regions during token
generation dynamically. This self-assessment capability enables precise
step-level feedback without manual fine-grained annotation, addressing a
critical challenge in process supervision. Experiments on the Qwen2.5-72B model
with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely
approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98%
reduction in query cost compared to prior methods. This work establishes
EDU-PRM as an efficient approach for scalable process reward model training.
☆ Learning to Instruct for Visual Instruction Tuning
We propose LIT, an advancement of visual instruction tuning (VIT). While VIT
equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the
current design choices for VIT often result in overfitting and shortcut
learning, potentially degrading performance. This gap arises from an
overemphasis on instruction-following abilities, while neglecting the proactive
understanding of visual information. Inspired by this, LIT adopts a simple yet
effective approach by incorporating the loss function into both the instruction
and response sequences. It seamlessly expands the training data, and
regularizes the MLLMs from overly relying on language priors. Based on this
merit, LIT achieves a significant relative improvement of up to 9% on
comprehensive multimodal benchmarks, requiring no additional training data and
incurring negligible computational overhead. Surprisingly, LIT attains
exceptional fundamental visual capabilities, yielding up to an 18% improvement
in captioning performance, while simultaneously alleviating hallucination in
MLLMs.
comment: 16 pages, 10 figures
☆ EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices
Transformer-based large language models (LLMs) encounter challenges in
processing long sequences on edge devices due to the quadratic complexity of
attention mechanisms and growing memory demands from Key-Value (KV) cache.
Existing KV cache optimizations struggle with irreversible token eviction in
long-output tasks, while alternative sequence modeling architectures prove
costly to adopt within established Transformer infrastructure. We present
EdgeInfinite, a memory-efficient solution for infinite contexts that integrates
compressed memory into Transformer-based LLMs through a trainable memory-gating
module. This approach maintains full compatibility with standard Transformer
architectures, requiring fine-tuning only a small part of parameters, and
enables selective activation of the memory-gating module for long and short
context task routing. The experimental result shows that EdgeInfinite achieves
comparable performance to baseline Transformer-based LLM on long context
benchmarks while optimizing memory consumption and time to first token.
comment: 8 pages, 3 figures
☆ Tokenization of Gaze Data
A considerable part of the performance of today's large language models
(LLM's) and multimodal large language models (MLLM's) depends on their
tokenization strategies. While tokenizers are extensively researched for
textual and visual input, there is no research on tokenization strategies for
gaze data due to its nature. However, a corresponding tokenization strategy
would allow using the vision capabilities of pre-trained MLLM's for gaze data,
for example, through fine-tuning.
In this paper, we aim to close this research gap by analyzing five different
tokenizers for gaze data on three different datasets for the forecasting and
generation of gaze data through LLMs (cf.~\cref{fig:teaser}). We evaluate the
tokenizers regarding their reconstruction and compression abilities. Further,
we train an LLM for each tokenization strategy, measuring its generative and
predictive performance. Overall, we found that a quantile tokenizer outperforms
all others in predicting the gaze positions and k-means is best when predicting
gaze velocities.
☆ FRASE: Structured Representations for Generalizable SPARQL Query Generation
Translating natural language questions into SPARQL queries enables Knowledge
Base querying for factual and up-to-date responses. However, existing datasets
for this task are predominantly template-based, leading models to learn
superficial mappings between question and query templates rather than
developing true generalization capabilities. As a result, models struggle when
encountering naturally phrased, template-free questions. This paper introduces
FRASE (FRAme-based Semantic Enhancement), a novel approach that leverages Frame
Semantic Role Labeling (FSRL) to address this limitation. We also present
LC-QuAD 3.0, a new dataset derived from LC-QuAD 2.0, in which each question is
enriched using FRASE through frame detection and the mapping of frame-elements
to their argument. We evaluate the impact of this approach through extensive
experiments on recent large language models (LLMs) under different fine-tuning
configurations. Our results demonstrate that integrating frame-based structured
representations consistently improves SPARQL generation performance,
particularly in challenging generalization scenarios when test questions
feature unseen templates (unknown template splits) and when they are all
naturally phrased (reformulated questions).
☆ Convolutional optimization with convex kernel and power lift
We focus on establishing the foundational paradigm of a novel optimization
theory based on convolution with convex kernels. Our goal is to devise a
morally deterministic model of locating the global optima of an arbitrary
function, which is distinguished from most commonly used statistical models.
Limited preliminary numerical results are provided to test the efficiency of
some specific algorithms derived from our paradigm, which we hope to stimulate
further practical interest.
☆ REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation
Vision-language models (VLMs) have demonstrated remarkable capabilities in
robotic planning, particularly for long-horizon tasks that require a holistic
understanding of the environment for task decomposition. Existing methods
typically rely on prior environmental knowledge or carefully designed
task-specific prompts, making them struggle with dynamic scene changes or
unexpected task conditions, e.g., a robot attempting to put a carrot in the
microwave but finds the door was closed. Such challenges underscore two
critical issues: adaptability and efficiency. To address them, in this work, we
propose an adaptive multi-agent planning framework, termed REMAC, that enables
efficient, scene-agnostic multi-robot long-horizon task planning and execution
through continuous reflection and self-evolution. REMAC incorporates two key
modules: a self-reflection module performing pre-condition and post-condition
checks in the loop to evaluate progress and refine plans, and a self-evolvement
module dynamically adapting plans based on scene-specific reasoning. It offers
several appealing benefits: 1) Robots can initially explore and reason about
the environment without complex prompt design. 2) Robots can keep reflecting on
potential planning errors and adapting the plan based on task-specific
insights. 3) After iterations, a robot can call another one to coordinate tasks
in parallel, maximizing the task execution efficiency. To validate REMAC's
effectiveness, we build a multi-agent environment for long-horizon robot
manipulation and navigation based on RoboCasa, featuring 4 task categories with
27 task styles and 50+ different objects. Based on it, we further benchmark
state-of-the-art reasoning models, including DeepSeek-R1, o3-mini, QwQ, and
Grok3, demonstrating REMAC's superiority by boosting average success rates by
40% and execution efficiency by 52.7% over the single robot baseline.
☆ Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories
Evaluating the value alignment of large language models (LLMs) has
traditionally relied on single-sentence adversarial prompts, which directly
probe models with ethically sensitive or controversial questions. However, with
the rapid advancements in AI safety techniques, models have become increasingly
adept at circumventing these straightforward tests, limiting their
effectiveness in revealing underlying biases and ethical stances. To address
this limitation, we propose an upgraded value alignment benchmark that moves
beyond single-sentence prompts by incorporating multi-turn dialogues and
narrative-based scenarios. This approach enhances the stealth and adversarial
nature of the evaluation, making it more robust against superficial safeguards
implemented in modern LLMs. We design and implement a dataset that includes
conversational traps and ethically ambiguous storytelling, systematically
assessing LLMs' responses in more nuanced and context-rich settings.
Experimental results demonstrate that this enhanced methodology can effectively
expose latent biases that remain undetected in traditional single-shot
evaluations. Our findings highlight the necessity of contextual and dynamic
testing for value alignment in LLMs, paving the way for more sophisticated and
realistic assessments of AI ethics and safety.
☆ Few-Shot Graph Out-of-Distribution Detection with LLMs
Existing methods for graph out-of-distribution (OOD) detection typically
depend on training graph neural network (GNN) classifiers using a substantial
amount of labeled in-distribution (ID) data. However, acquiring high-quality
labeled nodes in text-attributed graphs (TAGs) is challenging and costly due to
their complex textual and structural characteristics. Large language models
(LLMs), known for their powerful zero-shot capabilities in textual tasks, show
promise but struggle to naturally capture the critical structural information
inherent to TAGs, limiting their direct effectiveness.
To address these challenges, we propose LLM-GOOD, a general framework that
effectively combines the strengths of LLMs and GNNs to enhance data efficiency
in graph OOD detection. Specifically, we first leverage LLMs' strong zero-shot
capabilities to filter out likely OOD nodes, significantly reducing the human
annotation burden. To minimize the usage and cost of the LLM, we employ it only
to annotate a small subset of unlabeled nodes. We then train a lightweight GNN
filter using these noisy labels, enabling efficient predictions of ID status
for all other unlabeled nodes by leveraging both textual and structural
information. After obtaining node embeddings from the GNN filter, we can apply
informativeness-based methods to select the most valuable nodes for precise
human annotation. Finally, we train the target ID classifier using these
accurately annotated ID nodes. Extensive experiments on four real-world TAG
datasets demonstrate that LLM-GOOD significantly reduces human annotation costs
and outperforms state-of-the-art baselines in terms of both ID classification
accuracy and OOD detection performance.
☆ Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes
Electronic Health Records (EHRs) often lack explicit links between
medications and diagnoses, making clinical decision-making and research more
difficult. Even when links exist, diagnosis lists may be incomplete, especially
during early patient visits. Discharge summaries tend to provide more complete
information, which can help infer accurate diagnoses, especially with the help
of large language models (LLMs). This study investigates whether LLMs can
predict implicitly mentioned diagnoses from clinical notes and link them to
corresponding medications. We address two research questions: (1) Does majority
voting across diverse LLM configurations outperform the best single
configuration in diagnosis prediction? (2) How sensitive is majority voting
accuracy to LLM hyperparameters such as temperature, top-p, and summary length?
To evaluate, we created a new dataset of 240 expert-annotated
medication-diagnosis pairs from 20 MIMIC-IV notes. Using GPT-3.5 Turbo, we ran
18 prompting configurations across short and long summary lengths, generating
8568 test cases. Results show that majority voting achieved 75 percent
accuracy, outperforming the best single configuration at 66 percent. No single
hyperparameter setting dominated, but combining deterministic, balanced, and
exploratory strategies improved performance. Shorter summaries generally led to
higher accuracy.In conclusion, ensemble-style majority voting with diverse LLM
configurations improves diagnosis prediction in EHRs and offers a promising
method to link medications and diagnoses in clinical texts.
comment: 19 pages, 3 figures, 5 tables
☆ Penrose Tiled Low-Rank Compression and Section-Wise Q&A Fine-Tuning: A General Framework for Domain-Specific Large Language Model Adaptation
Large language models (LLMs) hold great promise for specialized scientific
domains such as materials science, yet adapting them efficiently and accurately
to domain-specific knowledge remains challenging due to limited data and high
knowledge density. We propose a two-stage framework that combines structured
model compression with a scientific fine-tuning regimen to address this
challenge. In the compression stage, we decompose the LLM's weight matrices
into local low-rank "rank blocks" and arrange these blocks in a Penrose-like
non-periodic tiling pattern. Each block is then compacted via spectral
transformations (e.g., discrete cosine or Fourier transforms), and a
Kullback-Leibler (KL) divergence-based alignment loss preserves the
distributional similarity between the compressed model's representations and
those of the original full model. In the adaptation stage, the compressed model
is further tuned using a human-like scientific reading protocol: it processes
technical materials science documents section by section, engaging in a
structured question-and-answer routine for each section. This section-wise Q&A
fine-tuning strategy extracts explicit reasoning traces and gradually injects
domain knowledge, while minimizing catastrophic forgetting of the model's
general language capabilities. By balancing efficient compression with targeted
adaptation, our two-stage approach enables precise specialization of LLMs to
high-value domains under data-scarce conditions. We present this principled yet
exploratory pipeline and outline its potential for advancing materials science
knowledge integration, laying the groundwork for comprehensive empirical
evaluation in future work.
☆ Non-Monotonic Attention-based Read/Write Policy Learning for Simultaneous Translation
Zeeshan Ahmed, Frank Seide, Zhe Liu, Rastislav Rabatin, Jachym Kolar, Niko Moritz, Ruiming Xie, Simone Merello, Christian Fuegen
Simultaneous or streaming machine translation generates translation while
reading the input stream. These systems face a quality/latency trade-off,
aiming to achieve high translation quality similar to non-streaming models with
minimal latency. We propose an approach that efficiently manages this
trade-off. By enhancing a pretrained non-streaming model, which was trained
with a seq2seq mechanism and represents the upper bound in quality, we convert
it into a streaming model by utilizing the alignment between source and target
tokens. This alignment is used to learn a read/write decision boundary for
reliable translation generation with minimal input. During training, the model
learns the decision boundary through a read/write policy module, employing
supervised learning on the alignment points (pseudo labels). The read/write
policy module, a small binary classification unit, can control the
quality/latency trade-off during inference. Experimental results show that our
model outperforms several strong baselines and narrows the gap with the
non-streaming baseline model.
♻ ☆ RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models CVPR 2025
The development of large language models (LLMs) has significantly enhanced
the capabilities of multimodal LLMs (MLLMs) as general assistants. However,
lack of user-specific knowledge still restricts their application in human's
daily life. In this paper, we introduce the Retrieval Augmented Personalization
(RAP) framework for MLLMs' personalization. Starting from a general MLLM, we
turn it into a personalized assistant in three steps. (a) Remember: We design a
key-value database to store user-related information, e.g., user's name, avatar
and other attributes. (b) Retrieve: When the user initiates a conversation, RAP
will retrieve relevant information from the database using a multimodal
retriever. (c) Generate: The input query and retrieved concepts' information
are fed into MLLMs to generate personalized, knowledge-augmented responses.
Unlike previous methods, RAP allows real-time concept editing via updating the
external database. To further improve generation quality and alignment with
user-specific information, we design a pipeline for data collection and create
a specialized dataset for personalized training of MLLMs. Based on the dataset,
we train a series of MLLMs as personalized multimodal assistants. By
pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual
concepts without additional finetuning. Our models demonstrate outstanding
flexibility and generation quality across a variety of tasks, such as
personalized image captioning, question answering and visual recognition. The
code, data and models are available at https://hoar012.github.io/RAP-Project/.
comment: Accepted by CVPR 2025. Code: https://github.com/Hoar012/RAP-MLLM
♻ ☆ Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
Misleading chart visualizations, which intentionally manipulate data
representations to support specific claims, can distort perceptions and lead to
incorrect conclusions. Despite decades of research, misleading visualizations
remain a widespread and pressing issue. Recent advances in multimodal large
language models (MLLMs) have demonstrated strong chart comprehension
capabilities, yet no existing work has systematically evaluated their ability
to detect and interpret misleading charts. This paper introduces the Misleading
Chart Question Answering (Misleading ChartQA) Benchmark, a large-scale
multimodal dataset designed to assess MLLMs in identifying and reasoning about
misleading charts. It contains over 3,000 curated examples, covering 21 types
of misleaders and 10 chart types. Each example includes standardized chart
code, CSV data, and multiple-choice questions with labeled explanations,
validated through multi-round MLLM checks and exhausted expert human review. We
benchmark 16 state-of-the-art MLLMs on our dataset, revealing their limitations
in identifying visually deceptive practices. We also propose a novel pipeline
that detects and localizes misleaders, enhancing MLLMs' accuracy in misleading
chart interpretation. Our work establishes a foundation for advancing
MLLM-driven misleading chart comprehension. We publicly release the sample
dataset to support further research in this critical area.
comment: 31 pages in total. Under Review For ARR
♻ ☆ Can Language Models Follow Multiple Turns of Entangled Instructions?
Despite significant achievements in improving the instruction-following
capabilities of large language models (LLMs), the ability to process multiple
potentially entangled or conflicting instructions remains a considerable
challenge. Real-world scenarios often require consistency across multiple
instructions over time, such as secret privacy, personal preferences, and
prioritization, which demand sophisticated abilities to integrate multiple
turns and carefully balance competing objectives when instructions intersect or
conflict. This work presents a systematic investigation of LLMs' capabilities
in handling multiple turns of instructions, covering three levels of
difficulty: (1) retrieving information from instructions, (2) tracking and
reasoning across turns, and (3) resolving conflicts among instructions. We
construct MultiTurnInstruct with around 1.1K high-quality multi-turn
conversations through the human-in-the-loop approach and result in nine
capability categories, including statics and dynamics, reasoning, and
multitasking. Our finding reveals an intriguing trade-off between different
capabilities. While GPT models demonstrate superior memorization, they show
reduced effectiveness in privacy-protection tasks requiring selective
information withholding. Larger models exhibit stronger reasoning capabilities
but still struggle with resolving conflicting instructions. Importantly, these
performance gaps cannot be attributed solely to information loss, as models
demonstrate strong BLEU scores on memorization tasks but their attention
mechanisms fail to integrate multiple related instructions effectively. These
findings highlight critical areas for improvement in complex real-world tasks
involving multi-turn instructions.
comment: 8 pages
♻ ☆ Do LLMs estimate uncertainty well in instruction-following?
Large language models (LLMs) could be valuable personal AI agents across
various domains, provided they can precisely follow user instructions. However,
recent studies have shown significant limitations in LLMs'
instruction-following capabilities, raising concerns about their reliability in
high-stakes applications. Accurately estimating LLMs' uncertainty in adhering
to instructions is critical to mitigating deployment risks. We present, to our
knowledge, the first systematic evaluation of the uncertainty estimation
abilities of LLMs in the context of instruction-following. Our study identifies
key challenges with existing instruction-following benchmarks, where multiple
factors are entangled with uncertainty stems from instruction-following,
complicating the isolation and comparison across methods and models. To address
these issues, we introduce a controlled evaluation setup with two benchmark
versions of data, enabling a comprehensive comparison of uncertainty estimation
methods under various conditions. Our findings show that existing uncertainty
methods struggle, particularly when models make subtle errors in instruction
following. While internal model states provide some improvement, they remain
inadequate in more complex scenarios. The insights from our controlled
evaluation setups provide a crucial understanding of LLMs' limitations and
potential for uncertainty estimation in instruction-following tasks, paving the
way for more trustworthy AI agents.
♻ ☆ Output Scouting: Auditing Large Language Models for Catastrophic Responses
Recent high profile incidents in which the use of Large Language Models
(LLMs) resulted in significant harm to individuals have brought about a growing
interest in AI safety. One reason LLM safety issues occur is that models often
have at least some non-zero probability of producing harmful outputs. In this
work, we explore the following scenario: imagine an AI safety auditor is
searching for catastrophic responses from an LLM (e.g. a "yes" responses to
"can I fire an employee for being pregnant?"), and is able to query the model a
limited number times (e.g. 1000 times). What is a strategy for querying the
model that would efficiently find those failure responses? To this end, we
propose output scouting: an approach that aims to generate semantically fluent
outputs to a given prompt matching any target probability distribution. We then
run experiments using two LLMs and find numerous examples of catastrophic
responses. We conclude with a discussion that includes advice for practitioners
who are looking to implement LLM auditing for catastrophic responses. We also
release an open-source toolkit (https://github.com/joaopfonseca/outputscouting)
that implements our auditing framework using the Hugging Face transformers
library.
comment: Work not ready, further experiments needed to validate the method
♻ ☆ Do LLMs "know" internally when they follow instructions?
Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley Ren, Udhay Nallasamy, Andy Miller, Jaya Narain
Instruction-following is crucial for building AI agents with large language
models (LLMs), as these models must adhere strictly to user-provided
constraints and guidelines. However, LLMs often fail to follow even simple and
clear instructions. To improve instruction-following behavior and prevent
undesirable outputs, a deeper understanding of how LLMs' internal states relate
to these outcomes is required. In this work, we investigate whether LLMs encode
information in their representations that correlate with instruction-following
success - a property we term knowing internally. Our analysis identifies a
direction in the input embedding space, termed the instruction-following
dimension, that predicts whether a response will comply with a given
instruction. We find that this dimension generalizes well across unseen tasks
but not across unseen instruction types. We demonstrate that modifying
representations along this dimension improves instruction-following success
rates compared to random changes, without compromising response quality.
Further investigation reveals that this dimension is more closely related to
the phrasing of prompts rather than the inherent difficulty of the task or
instructions. This work provides insight into the internal workings of LLMs'
instruction-following, paving the way for reliable LLM agents.
♻ ☆ SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Tianyang Pang, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Shanghaoran Quan, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jinyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang
Large language models (LLMs) have demonstrated remarkable proficiency in
mainstream academic disciplines such as mathematics, physics, and computer
science. However, human knowledge encompasses over 200 specialized disciplines,
far exceeding the scope of existing benchmarks. The capabilities of LLMs in
many of these specialized fields-particularly in light industry, agriculture,
and service-oriented disciplines-remain inadequately evaluated. To address this
gap, we present SuperGPQA, a comprehensive benchmark that evaluates
graduate-level knowledge and reasoning capabilities across 285 disciplines. Our
benchmark employs a novel Human-LLM collaborative filtering mechanism to
eliminate trivial or ambiguous questions through iterative refinement based on
both LLM responses and expert feedback. Our experimental results reveal
significant room for improvement in the performance of current state-of-the-art
LLMs across diverse knowledge domains (e.g., the reasoning-focused model
DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting
the considerable gap between current model capabilities and artificial general
intelligence. Additionally, we present comprehensive insights from our
management of a large-scale annotation process, involving over 80 expert
annotators and an interactive Human-LLM collaborative system, offering valuable
methodological guidance for future research initiatives of comparable scope.
♻ ☆ Function Alignment: A New Theory of Mind and Intelligence, Part I: Foundations
This paper introduces function alignment, a novel theory of mind and
intelligence that is both intuitively compelling and structurally grounded. It
explicitly models how meaning, interpretation, and analogy emerge from
interactions among layered representations, forming a coherent framework
capable not only of modeling minds but also of serving as a blueprint for
building them. One of the key theoretical insights derived from function
alignment is bounded interpretability, which provides a unified explanation for
previously fragmented ideas in cognitive science, such as bounded rationality,
symbol grounding, and analogy-making. Beyond modeling, the function alignment
framework bridges disciplines often kept apart, linking computational
architecture, psychological theory, and even contemplative traditions such as
Zen. Rather than building on any philosophical systems, it offers a structural
foundation upon which multiple ways of understanding the mind may be
reconstructed.
comment: 12 pages, 2 figures. Part I of a multi-part position paper on a new
theory of mind
♻ ☆ Outlier dimensions favor frequent tokens in language models
We study last-layer outlier dimensions, i.e. dimensions that display extreme
activations for the majority of inputs. We show that outlier dimensions arise
in many different modern language models, and trace their function back to the
heuristic of constantly predicting frequent words. We further show how a model
can block this heuristic when it is not contextually appropriate, by assigning
a counterbalancing weight mass to the remaining dimensions, and we investigate
which model parameters boost outlier dimensions and when they arise during
training. We conclude that outlier dimensions are a specialized mechanism
discovered by many distinct models to implement a useful token prediction
heuristic.
comment: 9 pages, 4 figures
♻ ☆ Leveraging ASIC AI Chips for Homomorphic Encryption
Jianming Tong, Tianhao Huang, Leo de Castro, Anirudh Itagi, Jingtian Dang, Anupam Golder, Asra Ali, Jevin Jiang, Arvind, G. Edward Suh, Tushar Krishna
Cloud-based services are making the outsourcing of sensitive client data
increasingly common. Although homomorphic encryption (HE) offers strong privacy
guarantee, it requires substantially more resources than computing on
plaintext, often leading to unacceptably large latencies in getting the
results. HE accelerators have emerged to mitigate this latency issue, but with
the high cost of ASICs. In this paper we show that HE primitives can be
converted to AI operators and accelerated on existing ASIC AI accelerators,
like TPUs, which are already widely deployed in the cloud. Adapting such
accelerators for HE requires (1) supporting modular multiplication, (2)
high-precision arithmetic in software, and (3) efficient mapping on matrix
engines. We introduce the CROSS compiler (1) to adopt Barrett reduction to
provide modular reduction support using multiplier and adder, (2) Basis Aligned
Transformation (BAT) to convert high-precision multiplication as low-precision
matrix-vector multiplication, (3) Matrix Aligned Transformation (MAT) to covert
vectorized modular operation with reduction into matrix multiplication that can
be efficiently processed on 2D spatial matrix engine. Our evaluation of CROSS
on a Google TPUv4 demonstrates significant performance improvements, with up to
161x and 5x speedup compared to the previous work on many-core CPUs and V100.
The kernel-level codes are open-sourced at
https://github.com/google/jaxite/tree/main/jaxite_word.
comment: 16 pages, 11 figures, 4 algorithms, 9 tables. Enabling Google TPUs
for privacy-preserving AI inference
♻ ☆ Whispering in Amharic: Fine-tuning Whisper for Low-resource Language
Dawit Ketema Gete, Bedru Yimam Ahmed, Tadesse Destaw Belay, Yohannes Ayana Ejigu, Sukairaj Hafiz Imam, Alemu Belay Tessema, Mohammed Oumer Adem, Tadesse Amare Belay, Robert Geislinger, Umma Aliyu Musa, Martin Semmann, Shamsuddeen Hassan Muhammad, Henning Schreiber, Seid Muhie Yimam
This work explores fine-tuning OpenAI's Whisper automatic speech recognition
(ASR) model for Amharic, a low-resource language, to improve transcription
accuracy. While the foundational Whisper model struggles with Amharic due to
limited representation in its training data, we fine-tune it using datasets
like Mozilla Common Voice, FLEURS, and the BDU-speech dataset. The
best-performing model, Whispersmall-am, significantly improves when finetuned
on a mix of existing FLEURS data and new, unseen Amharic datasets. Training
solely on new data leads to poor performance, but combining it with FLEURS data
reinforces the model, enabling better specialization in Amharic. We also
demonstrate that normalizing Amharic homophones significantly enhances Word
Error Rate (WER) and Bilingual Evaluation Understudy (BLEU) scores. This study
underscores the importance of fine-tuning strategies and dataset composition
for improving ASR in low-resource languages, providing insights for future
Amharic speech recognition research.
♻ ☆ Autonomous AI imitators increase diversity in homogeneous information ecosystems
Recent breakthroughs in large language models (LLMs) have facilitated
autonomous AI agents capable of imitating human-generated content. This
technological advancement raises fundamental questions about AI's impact on the
diversity and democratic value of information ecosystems. We introduce a
large-scale simulation framework to examine AI-based imitation within news, a
context crucial for public discourse. By systematically testing two distinct
imitation strategies across a range of information environments varying in
initial diversity, we demonstrate that AI-generated articles do not uniformly
homogenize content. Instead, AI's influence is strongly context-dependent:
AI-generated content can introduce valuable diversity in originally homogeneous
news environments but diminish diversity in initially heterogeneous contexts.
These results illustrate that the initial diversity of an information
environment critically shapes AI's impact, challenging assumptions that
AI-driven imitation threatens diversity. Instead, when information is initially
homogeneous, AI-driven imitation can expand perspectives, styles, and topics.
This is especially important in news contexts, where information diversity
fosters richer public debate by exposing citizens to alternative viewpoints,
challenging biases, and preventing narrative monopolies, which is essential for
a resilient democracy.
comment: 42 pages, 11 figures, 4 tables; v2: corrected typographical errors,
streamlined language, updated abstract, added supplementary information; v3:
restructured appendix, added temperature and embeddings sensitivity checks
♻ ☆ DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products ICLR 2025
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive
alternatives to Transformers for sequence modeling, offering efficient training
and linear-time inference. However, existing architectures face a fundamental
trade-off between expressivity and efficiency, dictated by the structure of
their state-transition matrices. While diagonal matrices used in architectures
like Mamba, GLA, or mLSTM yield fast runtime, they suffer from severely limited
expressivity. To address this, recent architectures such as (Gated) DeltaNet
and RWKV-7 adopted a diagonal plus rank-1 structure, allowing simultaneous
token-channel mixing, which overcomes some expressivity limitations with only a
slight decrease in training efficiency. Building on the interpretation of
DeltaNet's recurrence as performing one step of online gradient descent per
token on an associative recall loss, we introduce DeltaProduct, which instead
takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus
rank-$n_h$ state-transition matrices, formed as products of $n_h$ generalized
Householder transformations, providing a tunable mechanism to balance
expressivity and efficiency and a stable recurrence. Through extensive
experiments, we demonstrate that DeltaProduct achieves superior state-tracking
and language modeling capabilities while exhibiting significantly improved
length extrapolation compared to DeltaNet. Additionally, we also strengthen the
theoretical foundation of DeltaNet by proving that it can solve dihedral group
word problems in just two layers.
comment: Accepted at ICLR 2025 Workshop on Foundation Models in the Wild
♻ ☆ OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs
The use of omni-LLMs (large language models that accept any modality as
input), particularly for multimodal cognitive state tasks involving speech, is
understudied. We present OmniVox, the first systematic evaluation of four
omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely
used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot
omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside
our audio-only evaluation, we also evaluate omni-LLMs on text only and text and
audio. We present acoustic prompting, an audio-specific prompting strategy for
omni-LLMs which focuses on acoustic feature analysis, conversation context
analysis, and step-by-step reasoning. We compare our acoustic prompting to
minimal prompting and full chain-of-thought prompting techniques. We perform a
context window analysis on IEMOCAP and MELD, and find that using context helps,
especially on IEMOCAP. We conclude with an error analysis on the generated
acoustic reasoning outputs from the omni-LLMs.
comment: Submitted to COLM 2025. Preprint
♻ ☆ DomainCQA: Crafting Expert-Level QA from Domain-Specific Charts
Chart Question Answering (CQA) benchmarks are essential for evaluating the
capability of Multimodal Large Language Models (MLLMs) to interpret visual
data. However, current benchmarks focus primarily on the evaluation of
general-purpose CQA but fail to adequately capture domain-specific challenges.
We introduce DomainCQA, a systematic methodology for constructing
domain-specific CQA benchmarks, and demonstrate its effectiveness by developing
AstroChart, a CQA benchmark in the field of astronomy. Our evaluation shows
that chart reasoning and combining chart information with domain knowledge for
deeper analysis and summarization, rather than domain-specific knowledge, pose
the primary challenge for existing MLLMs, highlighting a critical gap in
current benchmarks. By providing a scalable and rigorous framework, DomainCQA
enables more precise assessment and improvement of MLLMs for domain-specific
applications.
comment: 87 pages, 65 figures
♻ ☆ EQ-Negotiator: An Emotion-Reasoning LLM Agent in Credit Dialogues
While large language model (LLM)-based chatbots have been applied for
effective engagement in credit dialogues, their capacity for dynamic emotional
expression remains limited. Current agents primarily rely on passive empathy
rather than affective reasoning. For instance, when faced with persistent
client negativity, the agent should employ strategic emotional adaptation by
expressing measured anger to discourage counterproductive behavior and guide
the conversation toward resolution. This context-aware emotional modulation is
essential for imitating the nuanced decision-making of human negotiators. This
paper introduces an EQ-negotiator that combines emotion sensing from
pre-trained language models (PLMs) with emotional reasoning based on Game
Theory and Hidden Markov Models. It takes into account both the current and
historical emotions of the client to better manage and address negative
emotions during interactions. By fine-tuning pre-trained language models (PLMs)
on public emotion datasets and validating them on the credit dialogue datasets,
our approach enables LLM-based agents to effectively capture shifts in client
emotions and dynamically adjust their response tone based on our emotion
decision policies in real-world financial negotiations. This EQ-negotiator can
also help credit agencies foster positive client relationships, enhancing
satisfaction in credit services.
♻ ☆ Evil twins are not that evil: Qualitative insights into machine-generated prompts
It has been widely observed that language models (LMs) respond in predictable
ways to algorithmically generated prompts that are seemingly unintelligible.
This is both a sign that we lack a full understanding of how LMs work, and a
practical challenge, because opaqueness can be exploited for harmful uses of
LMs, such as jailbreaking. We present the first thorough analysis of opaque
machine-generated prompts, or autoprompts, pertaining to 6 LMs of different
sizes and families. We find that machine-generated prompts are characterized by
a last token that is often intelligible and strongly affects the generation. A
small but consistent proportion of the previous tokens are prunable, probably
appearing in the prompt as a by-product of the fact that the optimization
process fixes the number of tokens. The remaining tokens fall into two
categories: filler tokens, which can be replaced with semantically unrelated
substitutes, and keywords, that tend to have at least a loose semantic relation
with the generation, although they do not engage in well-formed syntactic
relations with it. Additionally, human experts can reliably identify the most
influential tokens in an autoprompt a posteriori, suggesting these prompts are
not entirely opaque. Finally, some of the ablations we applied to autoprompts
yield similar effects in natural language inputs, suggesting that autoprompts
emerge naturally from the way LMs process linguistic inputs in general.
♻ ☆ Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning NAACL 2025
In NLP, Zero-Shot Classification (ZSC) has become essential for enabling
models to classify text into categories unseen during training, particularly in
low-resource languages and domains where labeled data is scarce. While
pretrained language models (PLMs) have shown promise in ZSC, they often rely on
large training datasets or external knowledge, limiting their applicability in
multilingual and low-resource scenarios. Recent approaches leveraging natural
language prompts reduce the dependence on large training datasets but struggle
to effectively incorporate available labeled data from related classification
tasks, especially when these datasets originate from different languages or
distributions. Moreover, existing prompt-based methods typically rely on
manually crafted prompts in a specific language, limiting their adaptability
and effectiveness in cross-lingual settings. To address these challenges, we
introduce RoSPrompt, a lightweight and data-efficient approach for training
soft prompts that enhance cross-lingual ZSC while ensuring robust
generalization across data distribution shifts. RoSPrompt is designed for small
multilingual PLMs, enabling them to leverage high-resource languages to improve
performance in low-resource settings without requiring extensive fine-tuning or
high computational costs. We evaluate our approach on multiple multilingual
PLMs across datasets covering 106 languages, demonstrating strong cross-lingual
transfer performance and robust generalization capabilities over unseen
classes.
comment: Workshop on Language Models for Underserved Communities (co-located
with NAACL 2025)
♻ ☆ VinaBench: Benchmark for Faithful and Consistent Visual Narratives CVPR 2025
Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, Antoine Bosselut
Visual narrative generation transforms textual narratives into sequences of
images illustrating the content of the text. However, generating visual
narratives that are faithful to the input text and self-consistent across
generated images remains an open challenge, due to the lack of knowledge
constraints used for planning the stories. In this work, we propose a new
benchmark, VinaBench, to address this challenge. Our benchmark annotates the
underlying commonsense and discourse constraints in visual narrative samples,
offering systematic scaffolds for learning the implicit strategies of visual
storytelling. Based on the incorporated narrative constraints, we further
propose novel metrics to closely evaluate the consistency of generated
narrative images and the alignment of generations with the input textual
narrative. Our results across three generative vision models demonstrate that
learning with VinaBench's knowledge constraints effectively improves the
faithfulness and cohesion of generated visual narratives.
comment: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR 2025)
♻ ☆ Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
While large language models (LLMs) have shown strong general reasoning
capabilities, their effectiveness in financial reasoning, which is crucial for
real-world financial applications remains underexplored. In this study, we
conduct a comprehensive evaluation of 24 state-of-the-art general and
reasoning-focused LLMs across four complex financial reasoning tasks involving
financial text, tabular data, and equations. We assess key capabilities such as
numerical reasoning, tabular interpretation, financial terminology
comprehension, long-context understanding, and equation-based problem solving.
Our analysis reveals that while data quality and pretraining contribute to
performance, general techniques like chain-of-thought (CoT) fine-tuning offer
limited gains in financial tasks. To address this, we propose two
domain-adapted models, Fino1-8B and Fino1-14B, trained with CoT fine-tuning and
reinforcement learning using domain-specific reasoning paths. Our models are
trained on a carefully curated dataset integrating high-quality examples from
diverse sources, covering financial reports, tables, equations, and structured
XBRL texts. Despite limited training data, they achieve an 7-9% performance
improvement, outperforming several advanced LLMs, including GPT-o1,
GPT-o3-mini, GPT-4.5, and comparable with DeepSeek models (V3 and R1),
demonstrating strong practical value in resource, constrained scenarios. Our
findings highlight the need for domain-specific adaptations in financial
reasoning, and we release all datasets, models, and code for future research.
comment: 13 pages, 2 figures, 3 Tables
♻ ☆ Retrieval Backward Attention without Additional Training: Enhance Embeddings of Large Language Models via Repetition
Language models can be viewed as functions that embed text into Euclidean
space, where the quality of the embedding vectors directly determines model
performance, training such neural networks involves various uncertainties. This
paper focuses on improving the performance of pre-trained language models in
zero-shot settings through a simple and easily implementable method. We propose
a novel backward attention mechanism to enhance contextual information
encoding. Evaluated on the Chinese Massive Text Embedding Benchmark (C-MTEB),
our approach achieves significant improvements across multiple tasks, providing
valuable insights for advancing zero-shot learning capabilities.
♻ ☆ ProTrix: Building Models for Planning and Reasoning over Tables with Sentence Context EMNLP 2024
Tables play a crucial role in conveying information in various domains. We
propose a Plan-then-Reason framework to answer different types of user queries
over tables with sentence context. The framework first plans the reasoning
paths over the context, then assigns each step to program-based or textual
reasoning to reach the final answer. This framework enhances the table
reasoning abilities for both in-context learning and fine-tuning methods.
GPT-3.5-Turbo following Plan-then-Reason framework surpasses other prompting
baselines without self-consistency while using less API calls and in-context
demonstrations. We also construct an instruction tuning set TrixInstruct to
evaluate the effectiveness of fine-tuning with this framework. We present
ProTrix model family by finetuning models on TrixInstruct. Our experiments show
that ProTrix family generalizes to diverse unseen tabular tasks with only 6k
training instances. We further demonstrate that ProTrix can generate accurate
and faithful explanations to answer complex free-form questions. Our work
underscores the importance of the planning and reasoning abilities towards a
model over tabular tasks with generalizability and interpretability. We
open-source our dataset and models at https://github.com/WilliamZR/ProTrix.
comment: EMNLP 2024 Findings
♻ ☆ SAIF: A Comprehensive Framework for Evaluating the Risks of Generative AI in the Public Sector AAAI
The rapid adoption of generative AI in the public sector, encompassing
diverse applications ranging from automated public assistance to welfare
services and immigration processes, highlights its transformative potential
while underscoring the pressing need for thorough risk assessments. Despite its
growing presence, evaluations of risks associated with AI-driven systems in the
public sector remain insufficiently explored. Building upon an established
taxonomy of AI risks derived from diverse government policies and corporate
guidelines, we investigate the critical risks posed by generative AI in the
public sector while extending the scope to account for its multimodal
capabilities. In addition, we propose a Systematic dAta generatIon Framework
for evaluating the risks of generative AI (SAIF). SAIF involves four key
stages: breaking down risks, designing scenarios, applying jailbreak methods,
and exploring prompt types. It ensures the systematic and consistent generation
of prompt data, facilitating a comprehensive evaluation while providing a solid
foundation for mitigating the risks. Furthermore, SAIF is designed to
accommodate emerging jailbreak methods and evolving prompt types, thereby
enabling effective responses to unforeseen risk scenarios. We believe that this
study can play a crucial role in fostering the safe and responsible integration
of generative AI into the public sector.
comment: 6 pages, 2 figures, 1 tables. AI for Public Missions (AIPM) Workshop
at the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)
♻ ☆ Sun-Shine: A Large Language Model for Tibetan Culture
Cheng Huang, Fan Gao, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Yongbin Yu
Tibetan, a minority language in China, features a highly intricate
grammatical structure, characterized by four verb tenses and a tense system
with frequent irregularities, contributing to its extensive inflectional
diversity. Recently, advances in Large Language Models (LLMs) have transformed
the paradigm in many domains. Despite the success in other fields, current LLMs
often fall short in catering to the needs of domain experts like Tibetans, and
the potential of LLMs for Tibetan culture is under-explored. The intrinsic
reasons are the immense and intricate nature of Tibetan culture as well as the
necessity for higher granularity and richness in knowledge. Simultaneously, the
complexity and uniqueness of its grammatical structure, coupled with its status
as a minority ethnic language, contribute to data scarcity, which remains a
fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine
(Sun-Shine), the first large language model for Tibetan culture, which is
expert in various Tibetan language processing tasks. Sun-Shine incorporates
state-of-the-art model architectures optimized for Tibetan's linguistic
features. We also propose TIB-STC, a comprehensive dataset comprising diverse
Tibetan texts such as literature, religious scripts, news, and conversational
data, which is also the first large-scale dataset for Tibetan culture. Though
comprehensive experiments, Sun-Shine not only demonstrates a higher level of
knowledge expertise for Tibetan culture but also gains preliminary embodied
intelligence capabilities in Tibetan language processing tasks, like language
modeling, text classification, machine translation, and syntactic analysis.
Moreover, it excels in low-resource scenarios, showcasing strong generalization
capabilities.
♻ ☆ Frame-Voyager: Learning to Query Frames for Video Large Language Models ICLR 2025
Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun
Video Large Language Models (Video-LLMs) have made remarkable progress in
video understanding tasks. However, they are constrained by the maximum length
of input tokens, making it impractical to input entire videos. Existing frame
selection approaches, such as uniform frame sampling and text-frame retrieval,
fail to account for the information density variations in the videos or the
complex instructions in the tasks, leading to sub-optimal performance. In this
paper, we propose Frame-Voyager that learns to query informative frame
combinations, based on the given textual queries in the task. To train
Frame-Voyager, we introduce a new data collection and labeling pipeline, by
ranking frame combinations using a pre-trained Video-LLM. Given a video of M
frames, we traverse its T-frame combinations, feed them into a Video-LLM, and
rank them based on Video-LLM's prediction losses. Using this ranking as
supervision, we train Frame-Voyager to query the frame combinations with lower
losses. In experiments, we evaluate Frame-Voyager on four Video Question
Answering benchmarks by plugging it into two different Video-LLMs. The
experimental results demonstrate that Frame-Voyager achieves impressive results
in all settings, highlighting its potential as a plug-and-play solution for
Video-LLMs.
comment: ICLR 2025, Camera-ready Version
♻ ☆ Measuring the Influence of Incorrect Code on Test Generation
It is natural to suppose that a Large Language Model is more likely to
generate correct test cases when prompted with correct code under test,
compared to incorrect code under test. However, the size of this effect has
never been previously measured, despite its obvious importance for both
practicing software engineers and researchers. To answer the question, we
conducted a comprehensive empirical study on 5 open source and 6 closed source
language models, with 3 widely-used benchmark data sets together with 41
repo-level real-world examples from two different real-world data sets. Our
results reveal that, when compared to incorrect code under test, LLMs prompted
with correct code achieve improvements in test accuracy, code coverage, and bug
detection of 57\%, 12\%, and 24\% respectively. We further show that these
scientific conclusions carry over from the three benchmark data sets to the
real-world code, where tests generated for incorrect code experience a 47\%
worse bug detection rate. Finally, we report that improvements of +18\% in
accuracy, +4\% coverage, and +34\% in bug detection can be achieved by
providing natural language code descriptions. These findings have actionable
conclusions. For example, the 47\% reduction in real-world bug detection is a
clear concern. Fortunately, it is a concern for which our findings about the
added value of descriptions offer an immediately actionable remedy.
comment: Under review
♻ ☆ Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, Dongbin Zhao
Recent advancements in post-training methodologies for large language models
(LLMs) have highlighted reinforcement learning (RL) as a critical component for
enhancing reasoning. However, the substantial computational costs associated
with RL-based approaches have led to growing interest in alternative paradigms,
such as Direct Preference Optimization (DPO). In this study, we investigate the
effectiveness of DPO in facilitating self-improvement for LLMs through
iterative preference-based learning. We demonstrate that a single round of DPO
with coarse filtering significantly enhances mathematical reasoning
performance, particularly for strong base model. Furthermore, we design an
iterative enhancement framework for both the generator and the reward model
(RM), enabling their mutual improvement through online interaction across
multiple rounds of DPO. Finally, with simple verifiable rewards, our model
DPO-VP achieves RL-level performance with significantly lower computational
overhead. These findings highlight DPO as a scalable and cost-effective
alternative to RL, offering a practical solution for enhancing LLM reasoning in
resource-constrained situations.
♻ ☆ Generalizable Prompt Learning of CLIP: A Brief Overview
Existing vision-language models (VLMs) such as CLIP have showcased an
impressive capability to generalize well across various downstream tasks. These
models leverage the synergy between visual and textual information, enabling
them to understand and reason about the content present in images and text in a
unified manner. This article provides a brief overview of CLIP based on
few-shot prompt learning, including experimental data and technical
characteristics of some methods. The purpose of this review is to provide a
reference for researchers who have just started their research in generalizable
prompting of CLIP through few-shot training for classification across 15
datasets and also to facilitate the integration of this field by researchers in
other downstream tasks.
♻ ☆ Dynamically Allocated Interval-Based Generative Linguistic Steganography with Roulette Wheel
Existing linguistic steganography schemes often overlook the conditional
probability (CP) of tokens in the candidate pool, allocating the one coding to
all tokens, which results in identical selection likelihoods. This approach
leads to the selection of low-CP tokens, degrading the quality of stegos and
making them more detectable. This paper proposes a scheme based on the interval
allocated, called DAIRstega. DAIRstega first uses a portion of the read secret
to build the roulette area. Then, this scheme uses the idea of the roulette
wheel and takes the CPs of tokens as the main basis for allocating the roulette
area (i.e., the interval length). Thus, tokens with larger CPs are allocated
more area. The secret will have an increased likelihood of selecting a token
with a higher CP. During allocation, we designed some allocation functions and
three constraints to optimize the process. Additionally, DAIRstega supports
prompt-based controllable generation of stegos. Rich experiments show that the
proposed embedding way and DAIRstega perform better than the existing ways and
baselines, which shows strong perceptual, statistical, and semantic
concealment, as well as anti-steganalysis ability. It can also generate
high-quality longer stegos, addressing the deficiencies in this task. DAIRstega
is confirmed to have potential as a secure watermarking, offering insights for
its development.
comment: 4 figures, 15 tables. Accepted for publication in Applied Soft
Computing (accepted versions, not the published versions). Thanks for the
support provided by MindSpore Community
♻ ☆ Overtrained Language Models Are Harder to Fine-Tune
Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan
Large language models are pre-trained on ever-growing token budgets under the
assumption that better pre-training performance translates to improved
downstream models. In this work, we challenge this assumption and show that
extended pre-training can make models harder to fine-tune, leading to degraded
final performance. We term this phenomenon catastrophic overtraining. For
example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to
over 2% worse performance on multiple standard LLM benchmarks than its 2.3T
token counterpart. Through controlled experiments and theoretical analysis, we
show that catastrophic overtraining arises from a systematic increase in the
broad sensitivity of pre-trained parameters to modifications, including but not
limited to fine-tuning. Our findings call for a critical reassessment of
pre-training design that considers the downstream adaptability of the model.
comment: 72 pages, 65 figures, 6 tables
♻ ☆ Auditing language models for hidden objectives
Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, Jan Kirchner, Jan Leike, Austin Meek, Kei Nishimura-Gasparian, Euan Ong, Christopher Olah, Adam Pearce, Fabien Roger, Jeanne Salle, Andy Shih, Meg Tong, Drake Thomas, Kelley Rivoire, Adam Jermyn, Monte MacDiarmid, Tom Henighan, Evan Hubinger
We study the feasibility of conducting alignment audits: investigations into
whether models have undesired objectives. As a testbed, we train a language
model with a hidden objective. Our training pipeline first teaches the model
about exploitable errors in RLHF reward models (RMs), then trains the model to
exploit some of these errors. We verify via out-of-distribution evaluations
that the model generalizes to exhibit whatever behaviors it believes RMs rate
highly, including ones not reinforced during training. We leverage this model
to study alignment audits in two ways. First, we conduct a blind auditing game
where four teams, unaware of the model's hidden objective or training,
investigate it for concerning behaviors and their causes. Three teams
successfully uncovered the model's hidden objective using techniques including
interpretability with sparse autoencoders (SAEs), behavioral attacks, and
training data analysis. Second, we conduct an unblinded follow-up study of
eight techniques for auditing the model, analyzing their strengths and
limitations. Overall, our work provides a concrete example of using alignment
audits to discover a model's hidden objective and proposes a methodology for
practicing and validating progress in alignment auditing.
♻ ☆ Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
Decoder-only discrete-token language models have recently achieved
significant success in automatic speech recognition. However, systematic
analyses of how different modalities impact performance in specific scenarios
remain limited. In this paper, we investigate the effects of multiple
modalities on recognition accuracy on both synthetic and real-world datasets.
Our experiments suggest that: (1) Integrating more modalities can increase
accuracy; in particular, our paper is, to our best knowledge, the first to show
the benefit of combining audio, image context, and lip information; (2) Images
as a supplementary modality for speech recognition provide the greatest benefit
at moderate noise levels, moreover, they exhibit a different trend compared to
inherently synchronized modalities like lip movements; (3) Performance improves
on both synthetic and real-world datasets when the most relevant visual
information is filtered as a preprocessing step.
♻ ☆ Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Ensuring AI safety is crucial as large language models become increasingly
integrated into real-world applications. A key challenge is jailbreak, where
adversarial prompts bypass built-in safeguards to elicit harmful disallowed
outputs. Inspired by psychological foot-in-the-door principles, we introduce
FITD,a novel multi-turn jailbreak method that leverages the phenomenon where
minor initial commitments lower resistance to more significant or more
unethical transgressions. Our approach progressively escalates the malicious
intent of user queries through intermediate bridge prompts and aligns the
model's response by itself to induce toxic responses. Extensive experimental
results on two jailbreak benchmarks demonstrate that FITD achieves an average
attack success rate of 94% across seven widely used models, outperforming
existing state-of-the-art methods. Additionally, we provide an in-depth
analysis of LLM self-corruption, highlighting vulnerabilities in current
alignment strategies and emphasizing the risks inherent in multi-turn
interactions. The code is available at
https://github.com/Jinxiaolong1129/Foot-in-the-door-Jailbreak.
comment: 19 pages, 8 figures
♻ ☆ Self-Rewarding Language Models ICML 2024
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
We posit that to achieve superhuman agents, future models require superhuman
feedback in order to provide an adequate training signal. Current approaches
commonly train reward models from human preferences, which may then be
bottlenecked by human performance level, and secondly these separate frozen
reward models cannot then learn to improve during LLM training. In this work,
we study Self-Rewarding Language Models, where the language model itself is
used via LLM-as-a-Judge prompting to provide its own rewards during training.
We show that during Iterative DPO training that not only does instruction
following ability improve, but also the ability to provide high-quality rewards
to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a
model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard,
including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still
to explore, this work opens the door to the possibility of models that can
continually improve in both axes.
comment: ICML 2024