Computation and Language 57
☆ WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Pretrained automatic speech recognition (ASR) models such as Whisper perform
well but still need domain adaptation to handle unseen vocabulary and parlance.
In many real-world settings, collecting speech data is impractical,
necessitating text-only adaptation. We propose WhisTLE, a deeply supervised,
text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE
trains a variational autoencoder (VAE) to model encoder outputs from text and
fine-tunes the decoder using the learned text-to-latent encoder, optionally
combined with text-to-speech (TTS) adaptation. At inference, the original
encoder is restored, incurring no extra runtime cost. Across four out-of-domain
datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by
12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines
in 27 of 32 scenarios.
comment: 5 pages, 2 figures
☆ DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong
Augmenting large language models (LLMs) with browsing tools substantially
improves their potential as deep search agents to solve complex, real-world
tasks. Yet, open LLMs still perform poorly in such settings due to limited
long-horizon reasoning capacity with browsing tools and the lack of
sufficiently difficult supervised data. To address these challenges, we present
DeepDive to advance deep search agents. First, we propose a strategy to
automatically synthesize complex, difficult, and hard-to-find questions from
open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement
learning (RL) to enhance LLMs' long-horizon reasoning with deep search.
Experiments show that DeepDive-32B achieves a new open-source competitive
result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and
Search-o1. We demonstrate that multi-turn RL training improves deep search
ability and significantly contributes to the performance improvements across
multiple benchmarks. We observe that DeepDive enables test-time scaling of tool
calls and parallel sampling. All datasets, models, and code are publicly
available at https://github.com/THUDM/DeepDive.
☆ RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment
To optimize the reasoning and problem-solving capabilities of Large Language
Models (LLMs), we propose a novel cloud-edge collaborative architecture that
enables a structured, multi-agent prompting framework. This framework comprises
three specialized components: GuideLLM, a lightweight model deployed at the
edge to provide methodological guidance; SolverLLM, a more powerful model
hosted in the cloud responsible for generating code solutions; and JudgeLLM, an
automated evaluator for assessing solution correctness and quality. To evaluate
and demonstrate the effectiveness of this architecture in realistic settings,
we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate
and enhance the performance of Large Language Models (LLMs) across multi-domain
coding tasks. Motivated by the limitations of existing benchmarks,
RefactorCoderQA systematically covers various technical domains, including
Software Engineering, Data Science, Machine Learning, and Natural Language
Processing, using authentic coding challenges from Stack Overflow. Extensive
experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves
state-of-the-art performance, significantly outperforming leading open-source
and commercial baselines with an overall accuracy of 76.84%. Human evaluations
further validate the interpretability, accuracy, and practical relevance of the
generated solutions. In addition, we evaluate system-level metrics, such as
throughput and latency, to gain deeper insights into the performance
characteristics and trade-offs of the proposed architecture.
comment: 12 pages, 5 figures, submitted to IEEE Transactions on Services
Computing
☆ Long Context Automated Essay Scoring with Language Models
Transformer-based language models are architecturally constrained to process
text of a fixed maximum length. Essays written by higher-grade students
frequently exceed the maximum allowed length for many popular open-source
models. A common approach to addressing this issue when using these models for
Automated Essay Scoring is to truncate the input text. This raises serious
validity concerns as it undermines the model's ability to fully capture and
evaluate organizational elements of the scoring rubric, which requires long
contexts to assess. In this study, we evaluate several models that incorporate
architectural modifications of the standard transformer architecture to
overcome these length limitations using the Kaggle ASAP 2.0 dataset. The models
considered in this study include fine-tuned versions of XLNet, Longformer,
ModernBERT, Mamba, and Llama models.
comment: 8 pages, 2 figures, 2 tables
☆ Is In-Context Learning Learning?
In-context learning (ICL) allows some autoregressive models to solve tasks
via next-token prediction and without needing further training. This has led to
claims about these model's ability to solve (learn) unseen tasks with only a
few shots (exemplars) in the prompt. However, deduction does not always imply
learning, as ICL does not explicitly encode a given observation. Instead, the
models rely on their prior knowledge and the exemplars given, if any. We argue
that, mathematically, ICL does constitute learning, but its full
characterisation requires empirical work. We then carry out a large-scale
analysis of ICL ablating out or accounting for memorisation, pretraining,
distributional shifts, and prompting style and phrasing. We find that ICL is an
effective learning paradigm, but limited in its ability to learn and generalise
to unseen tasks. We note that, in the limit where exemplars become more
numerous, accuracy is insensitive to exemplar distribution, model, prompt
style, and the input's linguistic features. Instead, it deduces patterns from
regularities in the prompt, which leads to distributional sensitivity,
especially in prompting styles such as chain-of-thought. Given the varied
accuracies on formally similar tasks, we conclude that autoregression's ad-hoc
encoding is not a robust mechanism, and suggests limited all-purpose
generalisability.
comment: Director's cut
☆ Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems
Failure attribution in multi-agent systems -- pinpointing the exact step
where a decisive error occurs -- is a critical yet unsolved challenge. Current
methods treat this as a pattern recognition task over long conversation logs,
leading to critically low step-level accuracy (below 17\%), which renders them
impractical for debugging complex systems. Their core weakness is a fundamental
inability to perform robust counterfactual reasoning: to determine if
correcting a single action would have actually averted the task failure. To
bridge this counterfactual inference gap, we introduce Abduct-Act-Predict (A2P)
Scaffolding, a novel agent framework that transforms failure attribution from
pattern recognition into a structured causal inference task. A2P explicitly
guides a large language model through a formal three-step reasoning process
within a single inference pass: (1) Abduction, to infer the hidden root causes
behind an agent's actions; (2) Action, to define a minimal corrective
intervention; and (3) Prediction, to simulate the subsequent trajectory and
verify if the intervention resolves the failure. This structured approach
leverages the holistic context of the entire conversation while imposing a
rigorous causal logic on the model's analysis. Our extensive experiments on the
Who\&When benchmark demonstrate its efficacy. On the Algorithm-Generated
dataset, A2P achieves 47.46\% step-level accuracy, a 2.85$\times$ improvement
over the 16.67\% of the baseline. On the more complex Hand-Crafted dataset, it
achieves 29.31\% step accuracy, a 2.43$\times$ improvement over the baseline's
12.07\%. By reframing the problem through a causal lens, A2P Scaffolding
provides a robust, verifiable, and significantly more accurate solution for
automated failure attribution.
☆ Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs EMNLP2025
Sparse Mixture-of-Experts (SMoE) architectures are widely used in large
language models (LLMs) due to their computational efficiency. However, though
only a few experts are activated for each token, SMoE still requires loading
all expert parameters, leading to high memory usage and challenges in
deployment. Previous work has tried to reduce the overhead by pruning and
merging experts, but primarily focused on expert-level operations, leaving
neuron-level structure underexplored. We propose DERN (Dropping Experts,
Recombining Neurons), a task-agnostic and retraining-free framework for expert
pruning and reconstruction. We observe that experts are often misaligned and
contain semantic conflicts at the neuron level, which poses challenges for
direct merging. To solve this, DERN works in three steps: it first prunes
redundant experts using router statistics; then it decomposes them into
neuron-level expert segments, assigning each segment to its most compatible
retained expert; and finally, it merges segments within each retained expert to
build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE
models show that DERN improves performance by more than 5% on commonsense
reasoning and MMLU benchmarks under 50% expert sparsity, without extra
training. It also greatly reduces the number of experts and memory usage,
making SMoE LLMs easier to deploy in practice.
comment: Accepted to EMNLP2025
☆ SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning
Large Language Models often generate unfaithful responses in knowledge
intensive tasks due to knowledge conflict,that is,a preference for relying on
internal parametric knowledge rather than the provided context.To address this
issue,we propose a novel self improving framework,Self Improving Faithfulness
Aware Contrastive Tuning.The framework uses a self instruct mechanism that
allows the base LLM to automatically generate high quality,structured
contrastive learning data,including anchor samples,semantically equivalent
positive samples,and negative samples simulating unfaithful scenarios.This
approach significantly reduces the cost of manual
annotation.Subsequently,contrastive learning is applied to train the
model,enabling it to pull faithful responses closer and push unfaithful
responses farther apart in the representation space.Experiments on knowledge
conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT
model based on Llama3 8B Instruct improves the Contextual Recall Rate by 6.2%
over the best baseline method,while significantly reducing dependence on
internal memory.The results indicate that SI FACT provides strong effectiveness
and high data efficiency in enhancing the contextual faithfulness of
LLMs,offering a practical pathway toward building more proactive and
trustworthy language models.
☆ Beyond Token Limits: Assessing Language Model Performance on Long Text Classification
Miklós Sebők, Viktor Kovács, Martin Bánóczy, Daniel Møller Eriksen, Nathalie Neptune, Philippe Roussille
The most widely used large language models in the social sciences (such as
BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text
length that they can process to produce predictions. This is a particularly
pressing issue for some classification tasks, where the aim is to handle long
input texts. One such area deals with laws and draft laws (bills), which can
have a length of multiple hundred pages and, therefore, are not particularly
amenable for processing with models that can only handle e.g. 512 tokens. In
this paper, we show results from experiments covering 5 languages with
XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass
classification task of the Comparative Agendas Project, which has a codebook of
21 policy topic labels from education to health care. Results show no
particular advantage for the Longformer model, pre-trained specifically for the
purposes of handling long inputs. The comparison between the GPT variants and
the best-performing open model yielded an edge for the latter. An analysis of
class-level factors points to the importance of support and substance overlaps
between specific categories when it comes to performance on long text inputs.
☆ Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations
In emotionally supportive conversations, well-intended positivity can
sometimes misfire, leading to responses that feel dismissive, minimizing, or
unrealistically optimistic. We examine this phenomenon of incongruent
positivity as miscalibrated expressions of positive support in both human and
LLM generated responses. To this end, we collected real user-assistant
dialogues from Reddit across a range of emotional intensities and generated
additional responses using large language models for the same context. We
categorize these conversations by intensity into two levels: Mild, which covers
relationship tension and general advice, and Severe, which covers grief and
anxiety conversations. This level of categorization enables a comparative
analysis of how supportive responses vary across lower and higher stakes
contexts. Our analysis reveals that LLMs are more prone to unrealistic
positivity through dismissive and minimizing tone, particularly in high-stakes
contexts. To further study the underlying dimensions of this phenomenon, we
finetune LLMs on datasets with strong and weak emotional reactions. Moreover,
we developed a weakly supervised multilabel classifier ensemble (DeBERTa and
MentalBERT) that shows improved detection of incongruent positivity types
across two sorts of concerns (Mild and Severe). Our findings shed light on the
need to move beyond merely generating generic positive responses and instead
study the congruent support measures to balance positive affect with emotional
acknowledgment. This approach offers insights into aligning large language
models with affective expectations in the online supportive dialogue, paving
the way toward context-aware and trust preserving online conversation systems.
comment: This paper is under review
☆ Benchmark of stylistic variation in LLM-generated texts
This study investigates the register variation in texts written by humans and
comparable texts produced by large language models (LLMs). Biber's
multidimensional analysis (MDA) is applied to a sample of human-written texts
and AI-created texts generated to be their counterparts to find the dimensions
of variation in which LLMs differ most significantly and most systematically
from humans. As textual material, a new LLM-generated corpus AI-Brown is used,
which is comparable to BE-21 (a Brown family corpus representing contemporary
British English). Since all languages except English are underrepresented in
the training data of frontier LLMs, similar analysis is replicated on Czech
using AI-Koditex corpus and Czech multidimensional model. Examined were 16
frontier models in various settings and prompts, with emphasis placed on the
difference between base models and instruction-tuned models. Based on this, a
benchmark is created through which models can be compared with each other and
ranked in interpretable dimensions.
☆ Error Analysis in a Modular Meeting Transcription System
Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach
Meeting transcription is a field of high relevance and remarkable progress in
recent years. Still, challenges remain that limit its performance. In this
work, we extend a previously proposed framework for analyzing leakage in speech
separation with proper sensitivity to temporal locality. We show that there is
significant leakage to the cross channel in areas where only the primary
speaker is active. At the same time, the results demonstrate that this does not
affect the final performance much as these leaked parts are largely ignored by
the voice activity detection (VAD). Furthermore, different segmentations are
compared showing that advanced diarization approaches are able to reduce the
gap to oracle segmentation by a third compared to a simple energy-based VAD. We
additionally reveal what factors contribute to the remaining difference. The
results represent state-of-the-art performance on LibriCSS among systems that
train the recognition module on LibriSpeech data only.
comment: Accepted at ITG Conference on Speech Communication 2025
☆ Towards Reliable and Interpretable Document Question Answering via VLMs
Vision-Language Models (VLMs) have shown strong capabilities in document
understanding, particularly in identifying and extracting textual information
from complex documents. Despite this, accurately localizing answers within
documents remains a major challenge, limiting both interpretability and
real-world applicability. To address this, we introduce
\textit{DocExplainerV0}, a plug-and-play bounding-box prediction module that
decouples answer generation from spatial localization. This design makes it
applicable to existing VLMs, including proprietary systems where fine-tuning is
not feasible. Through systematic evaluation, we provide quantitative insights
into the gap between textual accuracy and spatial grounding, showing that
correct answers often lack reliable localization. Our standardized framework
highlights these shortcomings and establishes a benchmark for future research
toward more interpretable and robust document information extraction VLMs.
☆ Population-Aligned Persona Generation for LLM-based Social Simulation
Zhengyu Hu, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Jianxun Lian, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie
Recent advances in large language models (LLMs) have enabled human-like
social simulations at unprecedented scale and fidelity, offering new
opportunities for computational social science. A key challenge, however, is
the construction of persona sets that authentically represent the diversity and
distribution of real-world populations. Most existing LLM-based social
simulation studies focus primarily on designing agentic frameworks and
simulation environments, often overlooking the complexities of persona
generation and the potential biases introduced by unrepresentative persona
sets. In this paper, we propose a systematic framework for synthesizing
high-quality, population-aligned persona sets for LLM-driven social simulation.
Our approach begins by leveraging LLMs to generate narrative personas from
long-term social media data, followed by rigorous quality assessment to filter
out low-fidelity profiles. We then apply importance sampling to achieve global
alignment with reference psychometric distributions, such as the Big Five
personality traits. To address the needs of specific simulation contexts, we
further introduce a task-specific module that adapts the globally aligned
persona set to targeted subpopulations. Extensive experiments demonstrate that
our method significantly reduces population-level bias and enables accurate,
flexible social simulation for a wide range of research and policy
applications.
☆ Prominence-aware automatic speech recognition for conversational speech
This paper investigates prominence-aware automatic speech recognition (ASR)
by combining prominence detection and speech recognition for conversational
Austrian German. First, prominence detectors were developed by fine-tuning
wav2vec2 models to classify word-level prominence. The detector was then used
to automatically annotate prosodic prominence in a large corpus. Based on those
annotations, we trained novel prominence-aware ASR systems that simultaneously
transcribe words and their prominence levels. The integration of prominence
information did not change performance compared to our baseline ASR system,
while reaching a prominence detection accuracy of 85.53% for utterances where
the recognized word sequence was correct. This paper shows that
transformer-based models can effectively encode prosodic information and
represents a novel contribution to prosody-enhanced ASR, with potential
applications for linguistic research and prosody-informed dialogue systems.
☆ Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records CCS
The development of medical chatbots in Arabic is significantly constrained by
the scarcity of large-scale, high-quality annotated datasets. While prior
efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from
social media to fine-tune large language models (LLMs), model scalability and
generalization remained limited. In this study, we propose a scalable synthetic
data augmentation strategy to expand the training corpus to 100,000 records.
Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated
80,000 contextually relevant and medically coherent synthetic question-answer
pairs grounded in the structure of the original dataset. These synthetic
samples were semantically filtered, manually validated, and integrated into the
training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2,
and evaluated their performance using BERTScore metrics and expert-driven
qualitative assessments. To further analyze the effectiveness of synthetic
sources, we conducted an ablation study comparing ChatGPT-4o and
Gemini-generated data independently. The results showed that ChatGPT-4o data
consistently led to higher F1-scores and fewer hallucinations across all
models. Overall, our findings demonstrate the viability of synthetic
augmentation as a practical solution for enhancing domain-specific language
models in-low resource medical NLP, paving the way for more inclusive,
scalable, and accurate Arabic healthcare chatbot systems.
comment: Accepted in AICCSA 2025
☆ VARCO-VISION-2.0 Technical Report
We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model
(VLM) for Korean and English with improved capabilities compared to the
previous model VARCO-VISION-14B. The model supports multi-image understanding
for complex inputs such as documents, charts, and tables, and delivers
layoutaware OCR by predicting both textual content and its spatial location.
Trained with a four-stage curriculum with memory-efficient techniques, the
model achieves enhanced multimodal alignment, while preserving core language
abilities and improving safety via preference optimization. Extensive benchmark
evaluations demonstrate strong spatial grounding and competitive results for
both languages, with the 14B model achieving 8th place on the OpenCompass VLM
leaderboard among models of comparable scale. Alongside the 14B-scale model, we
release a 1.7B version optimized for on-device deployment. We believe these
models advance the development of bilingual VLMs and their practical
applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a
full-scale 14B model and a lightweight 1.7B model.
comment: 19 pages, 1 figure, 14 tables. Technical report for VARCO-VISION-2.0,
a Korean-English bilingual VLM in 14B and 1.7B variants. Key features:
multi-image understanding, OCR with text localization, improved Korean
capabilities
☆ Arabic Large Language Models for Medical Text Generation
Efficient hospital management systems (HMS) are critical worldwide to address
challenges such as overcrowding, limited resources, and poor availability of
urgent health care. Existing methods often lack the ability to provide
accurate, real-time medical advice, particularly for irregular inputs and
underrepresented languages. To overcome these limitations, this study proposes
an approach that fine-tunes large language models (LLMs) for Arabic medical
text generation. The system is designed to assist patients by providing
accurate medical advice, diagnoses, drug recommendations, and treatment plans
based on user input. The research methodology required the collection of a
unique dataset from social media platforms, capturing real-world medical
conversations between patients and doctors. The dataset, which includes patient
complaints together with medical advice, was properly cleaned and preprocessed
to account for multiple Arabic dialects. Fine-tuning state-of-the-art
generative models, such as Mistral-7B-Instruct-v0.2, LLaMA-2-7B, and GPT-2
Medium, optimized the system's ability to generate reliable medical text.
Results from evaluations indicate that the fine-tuned Mistral-7B model
outperformed the other models, achieving average BERT (Bidirectional Encoder
Representations from Transformers) Score values in precision, recall, and
F1-scores of 68.5\%, 69.08\%, and 68.5\%, respectively. Comparative
benchmarking and qualitative assessments validate the system's ability to
produce coherent and relevant medical replies to informal input. This study
highlights the potential of generative artificial intelligence (AI) in
advancing HMS, offering a scalable and adaptable solution for global healthcare
challenges, especially in linguistically and culturally diverse environments.
comment: Published in 2025 4th International Conference on Computer
Technologies (ICCTech)
☆ Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery SIGIR 2025
The growing complexity and volume of climate science literature make it
increasingly difficult for researchers to find relevant information across
models, datasets, regions, and variables. This paper introduces a
domain-specific Knowledge Graph (KG) built from climate publications and
broader scientific texts, aimed at improving how climate knowledge is accessed
and used. Unlike keyword based search, our KG supports structured, semantic
queries that help researchers discover precise connections such as which models
have been validated in specific regions or which datasets are commonly used
with certain teleconnection patterns. We demonstrate how the KG answers such
questions using Cypher queries, and outline its integration with large language
models in RAG systems to improve transparency and reliability in
climate-related question answering. This work moves beyond KG construction to
show its real world value for climate researchers, model developers, and others
who rely on accurate, contextual scientific information.
comment: ACM SIGIR 2025 Workshop MANILA
☆ Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models
Researchers have applied established psychometric questionnaires (e.g., BFI,
PVQ) to measure the personality traits and values reflected in the responses of
Large Language Models (LLMs). However, concerns have been raised about applying
these human-designed questionnaires to LLMs. One such concern is their lack of
ecological validity--the extent to which survey questions adequately reflect
and resemble real-world contexts in which LLMs generate texts in response to
user queries. However, it remains unclear how established questionnaires and
ecologically valid questionnaires differ in their outcomes, and what insights
these differences may provide. In this paper, we conduct a comprehensive
comparative analysis of the two types of questionnaires. Our analysis reveals
that established questionnaires (1) yield substantially different profiles of
LLMs from ecologically valid ones, deviating from the psychological
characteristics expressed in the context of user queries, (2) suffer from
insufficient items for stable measurement, (3) create misleading impressions
that LLMs possess stable constructs, and (4) yield exaggerated profiles for
persona-prompted LLMs. Overall, our work cautions against the use of
established psychological questionnaires for LLMs. Our code will be released
upon publication.
comment: 17 pages, 4 figures
☆ !MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment EMNLP 2025
We present MSAs winning system for the BAREC 2025 Shared Task on fine-grained
Arabic readability assessment, achieving first place in six of six tracks. Our
approach is a confidence-weighted ensemble of four complementary transformer
models (AraBERTv2, AraELECTRA, MARBERT, and CAMeLBERT) each fine-tuned with
distinct loss functions to capture diverse readability signals. To tackle
severe class imbalance and data scarcity, we applied weighted training,
advanced preprocessing, SAMER corpus relabeling with our strongest model, and
synthetic data generation via Gemini 2.5 Flash, adding about 10,000 rare-level
samples. A targeted post-processing step corrected prediction distribution
skew, delivering a 6.3 percent Quadratic Weighted Kappa (QWK) gain. Our system
reached 87.5 percent QWK at the sentence level and 87.4 percent at the document
level, demonstrating the power of model and loss diversity, confidence-informed
fusion, and intelligent augmentation for robust Arabic readability prediction.
comment: 10 Pages , 8 figures , ArabicNLP 2025 , Co-located with EMNLP 2025
☆ Linguistic trajectories of bipolar disorder on social media
Language provides valuable markers of affective disorders such as bipolar
disorder (BD), yet clinical assessments remain limited in scale. In response,
analyses of social media (SM) language have gained prominence due to their high
temporal resolution and longitudinal scope. Here, we introduce a method to
determine the timing of users' diagnoses and apply it to study language
trajectories from 3 years before to 21 years after BD diagnosis - contrasted
with uses reporting unipolar depression (UD) and non-affected users (HC). We
show that BD diagnosis is accompanied by pervasive linguistic alterations
reflecting mood disturbance, psychiatric comorbidity, substance abuse,
hospitalization, medical comorbidities, unusual thought content, and
disorganized thought. We further observe recurring mood-related language
changes across two decades after the diagnosis, with a pronounced 12-month
periodicity suggestive of seasonal mood episodes. Finally, trend-level evidence
suggests an increased periodicity in users estimated to be female. In sum, our
findings provide evidence for language alterations in the acute and chronic
phase of BD. This validates and extends recent efforts leveraging SM for
scalable monitoring of mental health.
comment: Pre-print
☆ Unified Learnable 2D Convolutional Feature Extraction for ASR
Neural front-ends represent a promising approach to feature extraction for
automatic speech recognition (ASR) systems as they enable to learn specifically
tailored features for different tasks. Yet, many of the existing techniques
remain heavily influenced by classical methods. While this inductive bias may
ease the system design, our work aims to develop a more generic front-end for
feature extraction. Furthermore, we seek to unify the front-end architecture
contrasting with existing approaches that apply a composition of several layer
topologies originating from different sources. The experiments systematically
show how to reduce the influence of existing techniques to achieve a generic
front-end. The resulting 2D convolutional front-end is parameter-efficient and
suitable for a scenario with limited computational resources unlike large
models pre-trained on unlabeled audio. The results demonstrate that this
generic unified approach is not only feasible but also matches the performance
of existing supervised learnable feature extractors.
comment: Accepted at ITG Conference on Speech Communication 2025
☆ Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs
In this paper, we provide an extensive analysis of multi-label intent
classification using Large Language Models (LLMs) that are open-source,
publicly available, and can be run in consumer hardware. We use the MultiWOZ
2.1 dataset, a benchmark in the dialogue system domain, to investigate the
efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf,
Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot
setup, giving 20 examples in the prompt with some instructions. Our approach
focuses on the differences in performance of these models across several
performance metrics by methodically assessing these models on multi-label
intent classification tasks. Additionally, we compare the performance of the
instruction-based fine-tuning approach with supervised learning using the
smaller transformer model BertForSequenceClassification as a baseline. To
evaluate the performance of the models, we use evaluation metrics like
accuracy, precision, and recall as well as micro, macro, and weighted F1 score.
We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1
outperforms two other generative models on 11 intent classes out of 14 in terms
of F-Score, with a weighted average of 0.50. It also has relatively lower
Humming Loss and higher Jaccard Similarity, making it the winning model in the
few-shot setting. We find BERT based supervised classifier having superior
performance compared to the best performing few-shot generative LLM. The study
provides a framework for small open-source LLMs in detecting complex
multi-intent dialogues, enhancing the Natural Language Understanding aspect of
task-oriented chatbots.
☆ Unsupervised Hallucination Detection by Inspecting Reasoning Processes EMNLP 2025
Unsupervised hallucination detection aims to identify hallucinated content
generated by large language models (LLMs) without relying on labeled data.
While unsupervised methods have gained popularity by eliminating
labor-intensive human annotations, they frequently rely on proxy signals
unrelated to factual correctness. This misalignment biases detection probes
toward superficial or non-truth-related aspects, limiting generalizability
across datasets and scenarios. To overcome these limitations, we propose IRIS,
an unsupervised hallucination detection framework, leveraging internal
representations intrinsic to factual correctness. IRIS prompts the LLM to
carefully verify the truthfulness of a given statement, and obtain its
contextualized embedding as informative features for training. Meanwhile, the
uncertainty of each response is considered a soft pseudolabel for truthfulness.
Experimental results demonstrate that IRIS consistently outperforms existing
unsupervised methods. Our approach is fully unsupervised, computationally low
cost, and works well even with few training data, making it suitable for
real-time detection.
comment: To appear in EMNLP 2025
☆ CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
Minority languages in China, such as Tibetan, Uyghur, and Traditional
Mongolian, face significant challenges due to their unique writing systems,
which differ from international standards. This discrepancy has led to a severe
lack of relevant corpora, particularly for supervised tasks like headline
generation. To address this gap, we introduce a novel dataset, Chinese Minority
Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and
50,000 entries each for Uyghur and Mongolian, specifically curated for headline
generation tasks. Additionally, we propose a high-quality test set annotated by
native speakers, designed to serve as a benchmark for future research in this
domain. We hope this dataset will become a valuable resource for advancing
headline generation in Chinese minority languages and contribute to the
development of related benchmarks.
☆ Whisper Has an Internal Word Aligner
There is an increasing interest in obtaining accurate word-level timestamps
from strong automatic speech recognizers, in particular Whisper. Existing
approaches either require additional training or are simply not competitive.
The evaluation in prior work is also relatively loose, typically using a
tolerance of more than 200 ms. In this work, we discover attention heads in
Whisper that capture accurate word alignments and are distinctively different
from those that do not. Moreover, we find that using characters produces finer
and more accurate alignments than using wordpieces. Based on these findings, we
propose an unsupervised approach to extracting word alignments by filtering
attention heads while teacher forcing Whisper with characters. Our approach not
only does not require training but also produces word alignments that are more
accurate than prior work under a stricter tolerance between 20 ms and 100 ms.
comment: ASRU 2025
☆ Large Language Models Meet Legal Artificial Intelligence: A Survey
Large Language Models (LLMs) have significantly advanced the development of
Legal Artificial Intelligence (Legal AI) in recent years, enhancing the
efficiency and accuracy of legal tasks. To advance research and applications of
LLM-based approaches in legal domain, this paper provides a comprehensive
review of 16 legal LLMs series and 47 LLM-based frameworks for legal tasks, and
also gather 15 benchmarks and 29 datasets to evaluate different legal
capabilities. Additionally, we analyse the challenges and discuss future
directions for LLM-based approaches in the legal domain. We hope this paper
provides a systematic introduction for beginners and encourages future research
in this field. Resources are available at
https://github.com/ZhitianHou/LLMs4LegalAI.
♻ ☆ Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers EMNLP 2025
Traditional transformer models often allocate a fixed amount of computational
resources to every input token, leading to inefficient and unnecessary
computation. To address this, the Mixture of Depths (MoD) was introduced to
dynamically adjust the computational depth by skipping less important layers.
Despite its promise, current MoD approaches remain under-explored and face two
main challenges: (1) high training costs due to the need to train the entire
model along with the routers that determine which layers to skip, and (2) the
risk of performance degradation when important layers are bypassed. In response
to the first issue, we propose Router-Tuning, a method that fine-tunes only the
router on a small dataset, drastically reducing the computational overhead
associated with full model training. For the second challenge, we propose
MindSkip, which deploys Attention with Dynamic Depths. This method preserves
the model's performance while significantly enhancing computational and memory
efficiency. Extensive experiments demonstrate that our approach delivers
competitive results while dramatically improving the computation efficiency,
e.g., 21\% speedup and only a 0.2\% performance drop. The code is released at
https://github.com/CASE-Lab-UMD/Router-Tuning.
comment: EMNLP 2025 Main Conference
♻ ☆ Direct Judgement Preference Optimization EMNLP 2025
Auto-evaluation is crucial for assessing response quality and offering
feedback for model development. Recent studies have explored training large
language models (LLMs) as generative judges to evaluate and critique other
models' outputs. In this work, we investigate the idea of learning from both
positive and negative data with preference optimization to enhance the
evaluation capabilities of LLM judges across an array of different use cases.
We achieve this by employing three approaches to collect the preference pairs
for different use cases, each aimed at improving our generative judge from a
different perspective. Our comprehensive study over a wide range of benchmarks
demonstrates the effectiveness of our method. In particular, our generative
judge achieves the best performance on 10 out of 13 benchmarks, outperforming
strong baselines like GPT-4o and specialized judge models. Further analysis
show that our judge model robustly counters inherent biases such as position
and length bias, flexibly adapts to any evaluation protocol specified by
practitioners, and provides helpful language feedback for improving downstream
generator models.
comment: EMNLP 2025
♻ ☆ Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
NLP benchmarks rely on standardized datasets for training and evaluating
models and are crucial for advancing the field. Traditionally, expert
annotations ensure high-quality labels; however, the cost of expert annotation
does not scale well with the growing demand for larger datasets required by
modern models. While crowd-sourcing provides a more scalable solution, it often
comes at the expense of annotation precision and consistency. Recent
advancements in large language models (LLMs) offer new opportunities to enhance
the annotation process, particularly for detecting label errors in existing
datasets. In this work, we consider the recent approach of LLM-as-a-judge,
leveraging an ensemble of LLMs to flag potentially mislabeled examples. We
conduct a case study on four factual consistency datasets from the TRUE
benchmark, spanning diverse NLP tasks, and on SummEval, which uses Likert-scale
ratings of summary quality across multiple dimensions. We empirically analyze
the labeling quality of existing datasets and compare expert, crowd-sourced,
and LLM-based annotations in terms of the agreement, label quality, and
efficiency, demonstrating the strengths and limitations of each annotation
method. Our findings reveal a substantial number of label errors, which, when
corrected, induce a significant upward shift in reported model performance.
This suggests that many of the LLMs' so-called mistakes are due to label errors
rather than genuine model failures. Additionally, we discuss the implications
of mislabeled data and propose methods to mitigate them in training to improve
performance.
♻ ☆ Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu
Parallel thinking has emerged as a novel approach for enhancing the reasoning
capabilities of large language models (LLMs) by exploring multiple reasoning
paths concurrently. However, activating such capabilities through training
remains challenging, as existing methods predominantly rely on supervised
fine-tuning (SFT) over synthetic data, which encourages teacher-forced
imitation rather than exploration and generalization. Different from them, we
propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework
that enables parallel thinking behaviors for complex real-world reasoning
tasks. Our framework employs a progressive curriculum that explicitly addresses
the cold-start problem in training parallel thinking with RL. We first use SFT
on prompt-generated trajectories from easier tasks to instill the parallel
thinking ability, then transition to RL to explore and generalize this skill on
harder problems. Experiments on various math benchmarks, including MATH, AMC23,
and AIME, show that Parallel-R1 successfully instills parallel thinking,
leading to 8.4% accuracy improvements over the sequential thinking model
trained directly on challenging tasks with RL. Further analysis reveals a clear
shift in the model's thinking behavior: at an early stage, it uses parallel
thinking as an exploration strategy, while in a later stage, it uses the same
capability for multi-perspective verification. Most significantly, we validate
parallel thinking as a \textbf{mid-training exploration scaffold}, where this
temporary exploratory phase unlocks a higher performance ceiling after RL,
yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and
code will be open-source at https://github.com/zhengkid/Parallel-R1.
comment: Project website: https://zhengkid.github.io/Parallel_R1.github.io/
♻ ☆ Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models
We prove a new asymptotic un-equipartition property for the perplexity of
long texts generated by a language model and present supporting experimental
evidence from open-source models. Specifically we show that the logarithmic
perplexity of any large text generated by a language model must asymptotically
converge to the average entropy of its token distributions. This defines a
``typical set'' that all long synthetic texts generated by a language model
must belong to. We refine the concept of ''typical set'' to include only
grammatically correct texts. We then show that this refined typical set is a
vanishingly small subset of all possible grammatically correct texts for a very
general definition of grammar. This means that language models are strongly
constrained in the range of their possible behaviors and outputs. We make no
simplifying assumptions (such as stationarity) about the statistics of language
model outputs, and therefore our results are directly applicable to practical
real-world models without any approximations. We discuss possible applications
of the typical set concept to problems such as detecting synthetic texts and
membership inference in training datasets.
♻ ☆ UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs
Managing long texts is challenging for large language models (LLMs) due to
limited context window sizes. This study introduces UIO-LLMs, an unbiased
incremental optimization approach for memory-enhanced transformers under
long-context settings. We initially conceptualize the process as a streamlined
encoder-decoder framework where the weights-shared encoder and decoder
respectively encapsulate a context segment into memories and leverage these
memories to predict outputs of the subsequent segment. Subsequently, by
treating our memory-enhanced transformers as fully-connected recurrent neural
networks (RNNs), we refine the training process using the Truncated
Backpropagation Through Time (TBPTT) algorithm, which incorporates innovative
incremental optimization techniques. These techniques not only diminish time
complexity but also address the bias in gradient computation through an
unbiased optimization process. UIO-LLMs successfully handle long context, such
as extending the context window of Llama2-7b-chat from 4K to 100K tokens with
minimal 2% additional parameters, while keeping the inference cost nearly
linear as context length increases.
comment: This article was not accepted, and its quality is not very good.
Therefore, we have decided to withdraw the submission and will not resubmit
it elsewhere
♻ ☆ Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes EMNLP 2025
Humour, as a complex language form, is derived from myriad aspects of life.
Whilst existing work on computational humour has focussed almost exclusively on
short pun-based jokes, we investigate whether the ability of Large Language
Models (LLMs) to explain humour depends on the particular form. We compare
models' joke explanation abilities from simple puns to complex topical humour
that requires esoteric knowledge of real-world entities and events. To this
end, we curate a dataset of 600 jokes across 4 joke types and manually write
high-quality explanations. These jokes include heterographic and homographic
puns, contemporary internet humour, and topical jokes. Using this dataset, we
compare the zero-shot abilities of a range of LLMs to accurately and
comprehensively explain jokes of different types, identifying key research gaps
in the task of humour explanation. We find that none of the tested models
(including reasoning models) are capable of reliably generating adequate
explanations of all joke types, further highlighting the narrow focus of most
existing works on overly simple joke forms.
comment: Accepted to Findings of EMNLP 2025
♻ ☆ MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
Large language models (LLMs) possess broad world knowledge and strong
general-purpose reasoning ability, yet they struggle to learn from many
in-context examples on standard machine learning (ML) tasks, that is, to
leverage many-shot demonstrations purely via in-context learning (ICL) without
gradient descent. We introduce MachineLearningLM, a portable
continued-pretraining framework that equips a general-purpose LLM with robust
in-context ML capability while preserving its general knowledge and reasoning
for broader chat workflows.
Our pretraining procedure synthesizes ML tasks from millions of structural
causal models (SCMs), spanning shot counts up to 1,024. We begin with a
random-forest teacher, distilling tree-based decision strategies into the LLM
to strengthen robustness in numerical modeling. All tasks are serialized with a
token-efficient prompt, enabling 3x to 6x more examples per context window and
delivering up to 50x amortized throughput via batch inference.
Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8),
MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an
average of about 15% on out-of-distribution tabular classification across
finance, physics, biology, and healthcare domains. It exhibits a striking
many-shot scaling law: accuracy increases monotonically as in-context
demonstrations grow from 8 to 1,024. Without any task-specific training, it
attains random-forest-level accuracy across hundreds of shots. General chat
capabilities, including knowledge and reasoning, are preserved: it achieves
75.4% on MMLU.
♻ ☆ A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls
In this work, we propose and evaluate the feasibility of a two-stage pipeline
to evaluate literary machine translation, in a fine-grained manner, from
English to Korean. The results show that our framework provides fine-grained,
interpretable metrics suited for literary translation and obtains a higher
correlation with human judgment than traditional machine translation metrics.
Nonetheless, it still fails to match inter-human agreement, especially in
metrics like Korean Honorifics. We also observe that LLMs tend to favor
translations generated by other LLMs, and we highlight the necessity of
developing more sophisticated evaluation methods to ensure accurate and
culturally sensitive machine translation of literary works.
♻ ☆ Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification EMNLP 2025
Recent works have revealed the great potential of speculative decoding in
accelerating the autoregressive generation process of large language models.
The success of these methods relies on the alignment between draft candidates
and the sampled outputs of the target model. Existing methods mainly achieve
draft-target alignment with training-based methods, e.g., EAGLE, Medusa,
involving considerable training costs. In this paper, we present a
training-free alignment-augmented speculative decoding algorithm. We propose
alignment sampling, which leverages output distribution obtained in the
prefilling phase to provide more aligned draft candidates. To further benefit
from high-quality but non-aligned draft candidates, we also introduce a simple
yet effective flexible verification strategy. Through an adaptive probability
threshold, our approach can improve generation accuracy while further improving
inference efficiency. Experiments on 8 datasets (including question answering,
summarization and code completion tasks) show that our approach increases the
average generation score by 3.3 points for the LLaMA3 model. Our method
achieves a mean acceptance length up to 2.39 and speed up generation by 2.23.
comment: Accepted at EMNLP 2025 Main
♻ ☆ Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models EMNLP 2025
Yi Feng, Jiaqi Wang, Wenxuan Zhang, Zhuang Chen, Yutong Shen, Xiyao Xiao, Minlie Huang, Liping Jing, Jian Yu
Recent progress in large language models (LLMs) has opened new possibilities
for mental health support, yet current approaches lack realism in simulating
specialized psychotherapy and fail to capture therapeutic progression over
time. Narrative therapy, which helps individuals transform problematic life
stories into empowering alternatives, remains underutilized due to limited
access and social stigma. We address these limitations through a comprehensive
framework with two core components. First, INT (Interactive Narrative
Therapist) simulates expert narrative therapists by planning therapeutic
stages, guiding reflection levels, and generating contextually appropriate
expert-like responses. Second, IMA (Innovative Moment Assessment) provides a
therapy-centric evaluation method that quantifies effectiveness by tracking
"Innovative Moments" (IMs), critical narrative shifts in client speech
signaling therapy progress. Experimental results on 260 simulated clients and
230 human participants reveal that INT consistently outperforms standard LLMs
in therapeutic quality and depth. We further demonstrate the effectiveness of
INT in synthesizing high-quality support conversations to facilitate social
applications.
comment: EMNLP 2025 Main
♻ ☆ OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning
Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yuzheng Zhuang, Bowen Yang, He Zhu, Lingfeng Zhang, Pengwei Xie, David Gamaliel Arcos Bravo, Yingxue Zhang, Jianye Hao, Xingyue Quan
Recent advances in multimodal large language models (MLLMs) have opened new
opportunities for embodied intelligence, enabling multimodal understanding,
reasoning, and interaction, as well as continuous spatial decision-making.
Nevertheless, current MLLM-based embodied systems face two critical
limitations. First, Geometric Adaptability Gap: models trained solely on 2D
inputs or with hard-coded 3D geometry injection suffer from either insufficient
spatial information or restricted 2D generalization, leading to poor
adaptability across tasks with diverse spatial demands. Second, Embodiment
Constraint Gap: prior work often neglects the physical constraints and
capacities of real robots, resulting in task plans that are theoretically valid
but practically infeasible. To address these gaps, we introduce OmniEVA -- an
embodied versatile planner that enables advanced embodied reasoning and task
planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding
mechanism, which introduces a gated router to perform explicit selective
regulation of 3D fusion based on contextual requirements, enabling
context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware
Reasoning framework that jointly incorporates task goals and embodiment
constraints into the reasoning loop, resulting in planning decisions that are
both goal-directed and executable. Extensive experimental results demonstrate
that OmniEVA not only achieves state-of-the-art general embodied reasoning
performance, but also exhibits a strong ability across a wide range of
downstream scenarios. Evaluations of a suite of proposed embodied benchmarks,
including both primitive and composite tasks, confirm its robust and versatile
planning capabilities. Project page: https://omnieva.github.io
♻ ☆ Can Large Language Models Master Complex Card Games?
Complex games have long been an important benchmark for testing the progress
of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have
defeated top human players in Go and Chess, garnering widespread societal
attention towards artificial intelligence. Concurrently, large language models
(LLMs) have exhibited remarkable capabilities across various tasks, raising the
question of whether LLMs can achieve similar success in complex games. In this
paper, we explore the potential of LLMs in mastering complex card games. We
systematically assess the learning capabilities of LLMs across eight diverse
card games, evaluating the impact of fine-tuning on high-quality gameplay data,
and examining the models' ability to retain general capabilities while
mastering these games. Our findings indicate that: (1) LLMs can approach the
performance of strong game AIs through supervised fine-tuning on high-quality
data, (2) LLMs can master multiple complex card games simultaneously, with
performance augmentation for games with similar rules and conflicts for
dissimilar ones, and (3) LLMs experience a decline in general capabilities when
mastering complex games, but this decline can be mitigated by integrating a
certain amount of general instruction data. The evaluation results demonstrate
strong learning ability and versatility of LLMs.
♻ ☆ Input-Time Scaling
Current Large Language Models (LLMs) are usually post-trained on large-scale
carefully curated datasets (data & training scaling) and doing reasoning in
test time (inference time scaling). In this work, we present a new scaling
paradigm, Input-Time Scaling, to complement previous scaling methods by putting
resources on queries (input time). During training and testing, we utilize
meta-knowledge from LLMs to refine inputs with different strategies. We also
discover a new phenomenon, train-test co-design. It requires us to apply query
strategies during training and testing as a whole. Only applying strategies on
training or testing would seriously degrade the performance gained. We are also
surprised to find that seemingly low data quality datasets can perform better.
We can get the best performance even by adding irrelevant information to the
queries, with randomly selected 1k examples from a minimally filtered dataset.
These findings contradict the widely held inductive bias, "garbage in, garbage
out". Curating datasets with seemingly high-quality data can even potentially
limit the performance ceiling. In addition, models trained on more data with
similar quality (15k VS 1k) perform worse, the intuition of simply scaling the
size should also be carefully inspected. The good news is that our findings are
compatible with the Less is More phenomenon. 1K examples are enough to invoke
high-level reasoning ability. With experiments on Qwen2.5-32B-Instruct, we are
able to reach SOTA performance among 32B models on AIME24(76.7%) and
AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with
a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B,
the result would be 90.0% on AIME24 and 80.0% on AIME25. To facilitate
reproducibility and further research, we are working on open-source our
datasets, data pipelines, evaluation results, and checkpoints.
♻ ☆ Polish-English medical knowledge transfer: A new benchmark and results
Large Language Models (LLMs) have demonstrated significant potential in
handling specialized tasks, including medical problem-solving. However, most
studies predominantly focus on English-language contexts. This study introduces
a novel benchmark dataset based on Polish medical licensing and specialization
exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing
doctors pursuing specialization. The dataset was web-scraped from publicly
available resources provided by the Medical Examination Center and the Chief
Medical Chamber. It comprises over 24,000 exam questions, including a subset of
parallel Polish-English corpora, where the English portion was professionally
translated by the examination center for foreign candidates. By creating a
structured benchmark from these existing exam questions, we systematically
evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and
Polish-specific models, and compare their performance against human medical
students. Our analysis reveals that while models like GPT-4o achieve near-human
performance, significant challenges persist in cross-lingual translation and
domain-specific understanding. These findings underscore disparities in model
performance across languages and medical specialties, highlighting the
limitations and ethical considerations of deploying LLMs in clinical practice.
♻ ☆ Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts EMNLP
Modeling complex subjective tasks in Natural Language Processing, such as
recognizing emotion and morality, is considerably challenging due to
significant variation in human annotations. This variation often reflects
reasonable differences in semantic interpretations rather than mere noise,
necessitating methods to distinguish between legitimate subjectivity and error.
We address this challenge by exploring label verification in these contexts
using Large Language Models (LLMs). First, we propose a simple In-Context
Learning binary filtering baseline that estimates the reasonableness of a
document-label pair. We then introduce the Label-in-a-Haystack setting: the
query and its label(s) are included in the demonstrations shown to LLMs, which
are prompted to predict the label(s) again, while receiving task-specific
instructions (e.g., emotion recognition) rather than label copying. We show how
the failure to copy the label(s) to the output of the LLM are task-relevant and
informative. Building on this, we propose the Label-in-a-Haystack Rectification
(LiaHR) framework for subjective label correction: when the model outputs
diverge from the reference gold labels, we assign the generated labels to the
example instead of discarding it. This approach can be integrated into
annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses,
human evaluations, and ecological validity studies verify the utility of LiaHR
for label correction. Code is available at https://github.com/gchochla/liahr.
comment: Accepted to the Main Proceedings of EMNLP, 2025. 20 pages, 16
figures, 10 tables
♻ ☆ MoPD: Mixture-of-Prompts Distillation for Vision-Language Models
Soft prompt learning methods are effective for adapting vision-language
models (VLMs) to downstream tasks. Nevertheless, empirical evidence reveals a
tendency of existing methods that they overfit seen classes and exhibit
degraded performance on unseen classes. This limitation is due to the inherent
bias in the training data towards the seen classes. To address this issue, we
propose a novel soft prompt learning method, named Mixture-of-Prompts
Distillation (MoPD), which can effectively transfer useful knowledge from hard
prompts manually hand-crafted (a.k.a. teacher prompts) to the learnable soft
prompt (a.k.a. student prompt), thereby enhancing the generalization ability of
soft prompts on unseen classes. Moreover, the proposed MoPD method utilizes a
gating network that learns to select hard prompts used for prompt distillation.
Extensive experiments demonstrate that the proposed MoPD method outperforms
state-of-the-art baselines especially on on unseen classes.
♻ ☆ SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
As interest grows in generating long, detailed image captions, standard
evaluation metrics become increasingly unreliable. N-gram-based metrics though
efficient, fail to capture semantic correctness. Representational Similarity
(RS) metrics, designed to address this, initially saw limited use due to high
computational costs, while today, despite advances in hardware, they remain
unpopular due to low correlation to human judgments. Meanwhile, metrics based
on large language models (LLMs) show strong correlation with human judgments,
but remain too expensive for iterative use during model development.
We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS
metric tailored to long image captioning. SPECS modifies CLIP with a new
objective that emphasizes specificity: rewarding correct details and penalizing
incorrect ones. We show that SPECS matches the performance of open-source
LLM-based metrics in correlation to human judgments, while being far more
efficient. This makes it a practical alternative for iterative checkpoint
evaluation during image captioning model development.Our code can be found at
https://github.com/mbzuai-nlp/SPECS.
♻ ☆ Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison
Marianna Nezhurina, Jörg Franke, Taishi Nakamura, Timur Carstensen, Niccolò Ajroldi, Ville Komulainen, David Salinas, Jenia Jitsev
We introduce open-sci-ref, a family of dense transformer models trained as
research baselines across multiple model (0.13B to 1.7B parameters) and token
scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on
various standardized benchmarks, our training runs set establishes reference
points that enable researchers to assess the sanity and quality of alternative
training approaches across scales and datasets. Intermediate checkpoints allow
comparison and studying of the training dynamics. The established reference
baselines allow training procedures to be compared through their scaling
trends, aligning them on a common compute axis. Comparison of open reference
datasets reveals that training on NemoTron-CC HQ consistently outperforms other
reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to
intermediate training checkpoints, the release includes logs, code, and
downstream evaluations to simplify reproduction, standardize comparison, and
facilitate future research.
comment: Model weights and intermediate checkpoints are available at
https://huggingface.co/collections/open-sci/open-sci-ref-001-685905e598be658fbcebff4f;
code for reproducing training, evaluation and raw experiments data at
https://github.com/LAION-AI/open-sci-ref-0.01
♻ ☆ Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark
Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He
As AI advances toward general intelligence, the focus is shifting from
systems optimized for static tasks to creating open-ended agents that learn
continuously. In this paper, we introduce Experience-driven Lifelong Learning
(ELL), a framework for building self-evolving agents capable of continuous
growth through real-world interaction. The framework is built on four core
principles: (1) Experience Exploration: Agents learn through continuous,
self-motivated interaction with dynamic environments, navigating interdependent
tasks and generating rich experiential trajectories. (2) Long-term Memory:
Agents preserve and structure historical knowledge, including personal
experiences, domain expertise, and commonsense reasoning, into a persistent
memory system. (3) Skill Learning: Agents autonomously improve by abstracting
recurring patterns from experience into reusable skills, which are actively
refined and validated for application in new tasks. (4) Knowledge
Internalization: Agents internalize explicit and discrete experiences into
implicit and intuitive capabilities as "second nature".
We also introduce StuLife, a benchmark dataset for ELL that simulates a
student's holistic college journey, from enrollment to academic and personal
development, across three core phases and ten detailed sub-scenarios. StuLife
is designed around three key paradigm
♻ ☆ Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation
Generation capabilities and language coverage of multilingual large language
models (mLLMs) are advancing rapidly. However, evaluation practices for
generative abilities of mLLMs are still lacking comprehensiveness, scientific
rigor, and consistent adoption across research labs, which undermines their
potential to meaningfully guide mLLM development. We draw parallels with
machine translation (MT) evaluation, a field that faced similar challenges and
has, over decades, developed transparent reporting standards and reliable
evaluations for multilingual generative models. Through targeted experiments
across key stages of the generative evaluation pipeline, we demonstrate how
best practices from MT evaluation can deepen the understanding of quality
differences between models. Additionally, we identify essential components for
robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are
rigorously assessed. We distill these insights into a checklist of actionable
recommendations for mLLM research and development.
♻ ☆ Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models
Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
Large language models (LLMs) typically deploy safety mechanisms to prevent
harmful content generation. Most current approaches focus narrowly on risks
posed by malicious actors, often framing risks as adversarial events and
relying on defensive refusals. However, in real-world settings, risks also come
from non-malicious users seeking help while under psychological distress (e.g.,
self-harm intentions). In such cases, the model's response can strongly
influence the user's next actions. Simple refusals may lead them to repeat,
escalate, or move to unsafe platforms, creating worse outcomes. We introduce
Constructive Safety Alignment (CSA), a human-centric paradigm that protects
against malicious misuse while actively guiding vulnerable users toward safe
and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic
anticipation of user reactions, fine-grained risk boundary discovery, and
interpretable reasoning control, turning safety into a trust-building process.
Oy1 achieves state-of-the-art safety among open models while retaining high
general capabilities. On our Constructive Benchmark, it shows strong
constructive engagement, close to GPT-5, and unmatched robustness on the
Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from
refusal-first to guidance-first safety, CSA redefines the model-user
relationship, aiming for systems that are not just safe, but meaningfully
helpful. We release Oy1, code, and the benchmark to support responsible,
user-centered AI.
comment: Technical Report Code & Model weights available:
https://github.com/Alibaba-AAIG/Oyster
♻ ☆ Humanity's Last Exam
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes, Mobeen Mahmood, Oleksandr Pokutnyi, Oleg Iskra, Jessica P. Wang, John-Clark Levin, Mstyslav Kazakov, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Serguei Popov, Robert Gerbicz, Geoff Galgon, Johannes Schmitt, Will Yeadon, Yongki Lee, Scott Sauers, Alvaro Sanchez, Fabian Giska, Marc Roth, Søren Riis, Saiteja Utpala, Noah Burns, Gashaw M. Goshu, Mohinder Maheshbhai Naiya, Chidozie Agu, Zachary Giboney, Antrell Cheatom, Francesco Fournier-Facio, Sarah-Jane Crowson, Lennart Finke, Zerui Cheng, Jennifer Zampese, Ryan G. Hoerr, Mark Nandor, Hyunwoo Park, Tim Gehrunger, Jiaqi Cai, Ben McCarty, Alexis C Garretson, Edwin Taylor, Damien Sileo, Qiuyu Ren, Usman Qazi, Lianghui Li, Jungbae Nam, John B. Wydallis, Pavel Arkhipov, Jack Wei Lun Shi, Aras Bacho, Chris G. Willcocks, Hangrui Cao, Sumeet Motwani, Emily de Oliveira Santos, Johannes Veith, Edward Vendrow, Doru Cojoc, Kengo Zenitani, Joshua Robinson, Longke Tang, Yuqi Li, Joshua Vendrow, Natanael Wildner Fraga, Vladyslav Kuchkin, Andrey Pupasov Maksimov, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Aleksandar Mikov, Andrew Gritsevskiy, Julien Guillod, Gözdenur Demir, Dakotah Martinez, Ben Pageler, Kevin Zhou, Saeed Soori, Ori Press, Henry Tang, Paolo Rissone, Sean R. Green, Lina Brüssel, Moon Twayana, Aymeric Dieuleveut, Joseph Marvin Imperial, Ameya Prabhu, Jinzhou Yang, Nick Crispino, Arun Rao, Dimitri Zvonkine, Gabriel Loiseau, Mikhail Kalinin, Marco Lukas, Ciprian Manolescu, Nate Stambaugh, Subrata Mishra, Tad Hogg, Carlo Bosio, Brian P Coppola, Julian Salazar, Jaehyeok Jin, Rafael Sayous, Stefan Ivanov, Philippe Schwaller, Shaipranesh Senthilkuma, Andres M Bran, Andres Algaba, Kelsey Van den Houte, Lynn Van Der Sypt, Brecht Verbeken, David Noever, Alexei Kopylov, Benjamin Myklebust, Bikun Li, Lisa Schut, Evgenii Zheltonozhskii, Qiaochu Yuan, Derek Lim, Richard Stanley, Tong Yang, John Maar, Julian Wykowski, Martí Oller, Anmol Sahu, Cesare Giulio Ardito, Yuzheng Hu, Ariel Ghislain Kemogne Kamdoum, Alvin Jin, Tobias Garcia Vilchis, Yuexuan Zu, Martin Lackner, James Koppel, Gongbo Sun, Daniil S. Antonenko, Steffi Chern, Bingchen Zhao, Pierrot Arsene, Joseph M Cavanagh, Daofeng Li, Jiawei Shen, Donato Crisostomi, Wenjin Zhang, Ali Dehghan, Sergey Ivanov, David Perrella, Nurdin Kaparov, Allen Zang, Ilia Sucholutsky, Arina Kharlamova, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Shankar Sivarajan, Dan Bar Hava, Aleksey Kuchkin, David Holmes, Alexandra Rodriguez-Romero, Frank Sommerhage, Anji Zhang, Richard Moat, Keith Schneider, Zakayo Kazibwe, Don Clarke, Dae Hyun Kim, Felipe Meneguitti Dias, Sara Fish, Veit Elser, Tobias Kreiman, Victor Efren Guadarrama Vilchis, Immo Klose, Ujjwala Anantheswaran, Adam Zweiger, Kaivalya Rawal, Jeffery Li, Jeremy Nguyen, Nicolas Daans, Haline Heidinger, Maksim Radionov, Václav Rozhoň, Vincent Ginis, Christian Stump, Niv Cohen, Rafał Poświata, Josef Tkadlec, Alan Goldfarb, Chenguang Wang, Piotr Padlewski, Stanislaw Barzowski, Kyle Montgomery, Ryan Stendall, Jamie Tucker-Foltz, Jack Stade, T. Ryan Rogers, Tom Goertzen, Declan Grabb, Abhishek Shukla, Alan Givré, John Arnold Ambay, Archan Sen, Muhammad Fayez Aziz, Mark H Inlow, Hao He, Ling Zhang, Younesse Kaddar, Ivar Ängquist, Yanxu Chen, Harrison K Wang, Kalyan Ramakrishnan, Elliott Thornley, Antonio Terpin, Hailey Schoelkopf, Eric Zheng, Avishy Carmi, Ethan D. L. Brown, Kelin Zhu, Max Bartolo, Richard Wheeler, Martin Stehberger, Peter Bradshaw, JP Heimonen, Kaustubh Sridhar, Ido Akov, Jennifer Sandlin, Yury Makarychev, Joanna Tam, Hieu Hoang, David M. Cunningham, Vladimir Goryachev, Demosthenes Patramanis, Michael Krause, Andrew Redenti, David Aldous, Jesyin Lai, Shannon Coleman, Jiangnan Xu, Sangwon Lee, Ilias Magoulas, Sandy Zhao, Ning Tang, Michael K. Cohen, Orr Paradise, Jan Hendrik Kirchner, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Michael Wang, Yuzhou Nie, Anna Sztyber-Betley, Paolo Faraboschi, Robin Riblet, Jonathan Crozier, Shiv Halasyamani, Shreyas Verma, Prashant Joshi, Eli Meril, Ziqiao Ma, Jérémy Andréoletti, Raghav Singhal, Jacob Platnick, Volodymyr Nevirkovets, Luke Basler, Alexander Ivanov, Seri Khoury, Nils Gustafsson, Marco Piccardo, Hamid Mostaghimi, Qijia Chen, Virendra Singh, Tran Quoc Khánh, Paul Rosu, Hannah Szlyk, Zachary Brown, Himanshu Narayan, Aline Menezes, Jonathan Roberts, William Alley, Kunyang Sun, Arkil Patel, Max Lamparth, Anka Reuel, Linwei Xin, Hanmeng Xu, Jacob Loader, Freddie Martin, Zixuan Wang, Andrea Achilleos, Thomas Preu, Tomek Korbak, Ida Bosio, Fereshteh Kazemi, Ziye Chen, Biró Bálint, Eve J. Y. Lo, Jiaqi Wang, Maria Inês S. Nunes, Jeremiah Milbauer, M Saiful Bari, Zihao Wang, Behzad Ansarinejad, Yewen Sun, Stephane Durand, Hossam Elgnainy, Guillaume Douville, Daniel Tordera, George Balabanian, Hew Wolff, Lynna Kvistad, Hsiaoyun Milliron, Ahmad Sakor, Murat Eron, Andrew Favre D. O., Shailesh Shah, Xiaoxiang Zhou, Firuz Kamalov, Sherwin Abdoli, Tim Santens, Shaul Barkan, Allison Tee, Robin Zhang, Alessandro Tomasiello, G. Bruno De Luca, Shi-Zhuo Looi, Vinh-Kha Le, Noam Kolt, Jiayi Pan, Emma Rodman, Jacob Drori, Carl J Fossum, Niklas Muennighoff, Milind Jagota, Ronak Pradeep, Honglu Fan, Jonathan Eicher, Michael Chen, Kushal Thaman, William Merrill, Moritz Firsching, Carter Harris, Stefan Ciobâcă, Jason Gross, Rohan Pandey, Ilya Gusev, Adam Jones, Shashank Agnihotri, Pavel Zhelnov, Mohammadreza Mofayezi, Alexander Piperski, David K. Zhang, Kostiantyn Dobarskyi, Roman Leventov, Ignat Soroko, Joshua Duersch, Vage Taamazyan, Andrew Ho, Wenjie Ma, William Held, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah Leser, Michelle X Yuan, Laila Yacar, Johannes Lengler, Katarzyna Olszewska, Claudio Di Fratta, Edson Oliveira, Joseph W. Jackson, Andy Zou, Muthu Chidambaram, Timothy Manik, Hector Haffenden, Dashiell Stander, Ali Dasouqi, Alexander Shen, Bita Golshani, David Stap, Egor Kretov, Mikalai Uzhou, Alina Borisovna Zhidkovskaya, Nick Winter, Miguel Orbegozo Rodriguez, Robert Lauff, Dustin Wehr, Colin Tang, Zaki Hossain, Shaun Phillips, Fortuna Samuele, Fredrik Ekström, Angela Hammon, Oam Patel, Faraz Farhidi, George Medley, Forough Mohammadzadeh, Madellene Peñaflor, Haile Kassahun, Alena Friedrich, Rayner Hernandez Perez, Daniel Pyda, Taom Sakal, Omkar Dhamane, Ali Khajegili Mirabadi, Eric Hallman, Kenchi Okutsu, Mike Battaglia, Mohammad Maghsoudimehrabani, Alon Amit, Dave Hulbert, Roberto Pereira, Simon Weber, Handoko, Anton Peristyy, Stephen Malina, Mustafa Mehkary, Rami Aly, Frank Reidegeld, Anna-Katharina Dick, Cary Friday, Mukhwinder Singh, Hassan Shapourian, Wanyoung Kim, Mariana Costa, Hubeyb Gurdogan, Harsh Kumar, Chiara Ceconello, Chao Zhuang, Haon Park, Micah Carroll, Andrew R. Tawfeek, Stefan Steinerberger, Daattavya Aggarwal, Michael Kirchhof, Linjie Dai, Evan Kim, Johan Ferret, Jainam Shah, Yuzhou Wang, Minghao Yan, Krzysztof Burdzy, Lixin Zhang, Antonio Franca, Diana T. Pham, Kang Yong Loh, Joshua Robinson, Abram Jackson, Paolo Giordano, Philipp Petersen, Adrian Cosma, Jesus Colino, Colin White, Jacob Votava, Vladimir Vinnikov, Ethan Delaney, Petr Spelda, Vit Stritecky, Syed M. Shahid, Jean-Christophe Mourrat, Lavr Vetoshkin, Koen Sponselee, Renas Bacho, Zheng-Xin Yong, Florencia de la Rosa, Nathan Cho, Xiuyu Li, Guillaume Malod, Orion Weller, Guglielmo Albani, Leon Lang, Julien Laurendeau, Dmitry Kazakov, Fatimah Adesanya, Julien Portier, Lawrence Hollom, Victor Souza, Yuchen Anna Zhou, Julien Degorre, Yiğit Yalın, Gbenga Daniel Obikoya, Rai, Filippo Bigi, M. C. Boscá, Oleg Shumar, Kaniuar Bacho, Gabriel Recchia, Mara Popescu, Nikita Shulga, Ngefor Mildred Tanwie, Thomas C. H. Lux, Ben Rank, Colin Ni, Matthew Brooks, Alesia Yakimchyk, Huanxu, Liu, Stefano Cavalleri, Olle Häggström, Emil Verkama, Joshua Newbould, Hans Gundlach, Leonor Brito-Santana, Brian Amaro, Vivek Vajipey, Rynaa Grover, Ting Wang, Yosi Kratish, Wen-Ding Li, Sivakanth Gopi, Andrea Caciolai, Christian Schroeder de Witt, Pablo Hernández-Cámara, Emanuele Rodolà, Jules Robins, Dominic Williamson, Vincent Cheng, Brad Raynor, Hao Qi, Ben Segev, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Mao Mao, Christoph Demian, Peyman Kassani, Xinyu Zhang, David Avagian, Eshawn Jessica Scipio, Alon Ragoler, Justin Tan, Blake Sims, Rebeka Plecnik, Aaron Kirtland, Omer Faruk Bodur, D. P. Shinde, Yan Carlos Leyva Labrador, Zahra Adoul, Mohamed Zekry, Ali Karakoc, Tania C. B. Santos, Samir Shamseldeen, Loukmane Karim, Anna Liakhovitskaia, Nate Resman, Nicholas Farina, Juan Carlos Gonzalez, Gabe Maayan, Earth Anderson, Rodrigo De Oliveira Pena, Elizabeth Kelley, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, Ross Finocchio, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Isaac C. McAlister, Alejandro José Moyano, Alexey Pronin, Jing Fan, Angel Ramirez-Trinidad, Yana Malysheva, Daphiny Pottmaier, Omid Taheri, Stanley Stepanic, Samuel Perry, Luke Askew, Raúl Adrián Huerta Rodríguez, Ali M. R. Minissi, Ricardo Lorena, Krishnamurthy Iyer, Arshad Anil Fasiludeen, Ronald Clark, Josh Ducey, Matheus Piza, Maja Somrak, Eric Vergo, Juehang Qin, Benjámin Borbás, Eric Chu, Jack Lindsey, Antoine Jallon, I. M. J. McInnis, Evan Chen, Avi Semler, Luk Gloor, Tej Shah, Marc Carauleanu, Pascal Lauer, Tran Đuc Huy, Hossein Shahrtash, Emilien Duc, Lukas Lewark, Assaf Brown, Samuel Albanie, Brian Weber, Warren S. Vaz, Pierre Clavier, Yiyang Fan, Gabriel Poesia Reis e Silva, Long, Lian, Marcus Abramovitch, Xi Jiang, Sandra Mendoza, Murat Islam, Juan Gonzalez, Vasilios Mavroudis, Justin Xu, Pawan Kumar, Laxman Prasad Goswami, Daniel Bugas, Nasser Heydari, Ferenc Jeanplong, Thorben Jansen, Antonella Pinto, Archimedes Apronti, Abdallah Galal, Ng Ze-An, Ankit Singh, Tong Jiang, Joan of Arc Xavier, Kanu Priya Agarwal, Mohammed Berkani, Gang Zhang, Zhehang Du, Benedito Alves de Oliveira Junior, Dmitry Malishev, Nicolas Remy, Taylor D. Hartman, Tim Tarver, Stephen Mensah, Gautier Abou Loume, Wiktor Morak, Farzad Habibi, Sarah Hoback, Will Cai, Javier Gimenez, Roselynn Grace Montecillo, Jakub Łucki, Russell Campbell, Asankhaya Sharma, Khalida Meer, Shreen Gul, Daniel Espinosa Gonzalez, Xavier Alapont, Alex Hoover, Gunjan Chhablani, Freddie Vargus, Arunim Agarwal, Yibo Jiang, Deepakkumar Patil, David Outevsky, Kevin Joseph Scaria, Rajat Maheshwari, Abdelkader Dendane, Priti Shukla, Ashley Cartwright, Sergei Bogdanov, Niels Mündler, Sören Möller, Luca Arnaboldi, Kunvar Thaman, Muhammad Rehan Siddiqi, Prajvi Saxena, Himanshu Gupta, Tony Fruhauff, Glen Sherman, Mátyás Vincze, Siranut Usawasutsakorn, Dylan Ler, Anil Radhakrishnan, Innocent Enyekwe, Sk Md Salauddin, Jiang Muzhen, Aleksandr Maksapetyan, Vivien Rossbach, Chris Harjadi, Mohsen Bahaloohoreh, Claire Sparrow, Jasdeep Sidhu, Sam Ali, Song Bian, John Lai, Eric Singer, Justine Leon Uro, Greg Bateman, Mohamed Sayed, Ahmed Menshawy, Darling Duclosel, Dario Bezzi, Yashaswini Jain, Ashley Aaron, Murat Tiryakioglu, Sheeshram Siddh, Keith Krenek, Imad Ali Shah, Jun Jin, Scott Creighton, Denis Peskoff, Zienab EL-Wasif, Ragavendran P V, Michael Richmond, Joseph McGowan, Tejal Patwardhan, Hao-Yu Sun, Ting Sun, Nikola Zubić, Samuele Sala, Stephen Ebert, Jean Kaddour, Manuel Schottdorf, Dianzhuo Wang, Gerol Petruzella, Alex Meiburg, Tilen Medved, Ali ElSheikh, S Ashwin Hebbar, Lorenzo Vaquero, Xianjun Yang, Jason Poulos, Vilém Zouhar, Sergey Bogdanik, Mingfang Zhang, Jorge Sanz-Ros, David Anugraha, Yinwei Dai, Anh N. Nhu, Xue Wang, Ali Anil Demircali, Zhibai Jia, Yuyin Zhou, Juncheng Wu, Mike He, Nitin Chandok, Aarush Sinha, Gaoxiang Luo, Long Le, Mickaël Noyé, Michał Perełkiewicz, Ioannis Pantidis, Tianbo Qi, Soham Sachin Purohit, Letitia Parcalabescu, Thai-Hoa Nguyen, Genta Indra Winata, Edoardo M. Ponti, Hanchen Li, Kaustubh Dhole, Jongee Park, Dario Abbondanza, Yuanli Wang, Anupam Nayak, Diogo M. Caetano, Antonio A. W. L. Wong, Maria del Rio-Chanona, Dániel Kondor, Pieter Francois, Ed Chalstrey, Jakob Zsambok, Dan Hoyer, Jenny Reddish, Jakob Hauser, Francisco-Javier Rodrigo-Ginés, Suchandra Datta, Maxwell Shepherd, Thom Kamphuis, Qizheng Zhang, Hyunjun Kim, Ruiji Sun, Jianzhu Yao, Franck Dernoncourt, Satyapriya Krishna, Sina Rismanchian, Bonan Pu, Francesco Pinto, Yingheng Wang, Kumar Shridhar, Kalon J. Overholt, Glib Briia, Hieu Nguyen, David, Soler Bartomeu, Tony CY Pang, Adam Wecker, Yifan Xiong, Fanfei Li, Lukas S. Huber, Joshua Jaeger, Romano De Maddalena, Xing Han Lù, Yuhui Zhang, Claas Beger, Patrick Tser Jern Kon, Sean Li, Vivek Sanker, Ming Yin, Yihao Liang, Xinlu Zhang, Ankit Agrawal, Li S. Yifei, Zechen Zhang, Mu Cai, Yasin Sonmez, Costin Cozianu, Changhao Li, Alex Slen, Shoubin Yu, Hyun Kyu Park, Gabriele Sarti, Marcin Briański, Alessandro Stolfo, Truong An Nguyen, Mike Zhang, Yotam Perlitz, Jose Hernandez-Orallo, Runjia Li, Amin Shabani, Felix Juefei-Xu, Shikhar Dhingra, Orr Zohar, My Chiffon Nguyen, Alexander Pondaven, Abdurrahim Yilmaz, Xuandong Zhao, Chuanyang Jin, Muyan Jiang, Stefan Todoran, Xinyao Han, Jules Kreuer, Brian Rabern, Anna Plassart, Martino Maggetti, Luther Yap, Robert Geirhos, Jonathon Kean, Dingsu Wang, Sina Mollaei, Chenkai Sun, Yifan Yin, Shiqi Wang, Rui Li, Yaowen Chang, Anjiang Wei, Alice Bizeul, Xiaohan Wang, Alexandre Oliveira Arrais, Kushin Mukherjee, Jorge Chamorro-Padial, Jiachen Liu, Xingyu Qu, Junyi Guan, Adam Bouyamourn, Shuyu Wu, Martyna Plomecka, Junda Chen, Mengze Tang, Jiaqi Deng, Shreyas Subramanian, Haocheng Xi, Haoxuan Chen, Weizhi Zhang, Yinuo Ren, Haoqin Tu, Sejong Kim, Yushun Chen, Sara Vera Marjanović, Junwoo Ha, Grzegorz Luczyna, Jeff J. Ma, Zewen Shen, Dawn Song, Cedegao E. Zhang, Zhun Wang, Gaël Gendron, Yunze Xiao, Leo Smucker, Erica Weng, Kwok Hao Lee, Zhe Ye, Stefano Ermon, Ignacio D. Lopez-Miguel, Theo Knights, Anthony Gitter, Namkyu Park, Boyi Wei, Hongzheng Chen, Kunal Pai, Ahmed Elkhanany, Han Lin, Philipp D. Siedler, Jichao Fang, Ritwik Mishra, Károly Zsolnai-Fehér, Xilin Jiang, Shadab Khan, Jun Yuan, Rishab Kumar Jain, Xi Lin, Mike Peterson, Zhe Wang, Aditya Malusare, Maosen Tang, Isha Gupta, Ivan Fosin, Timothy Kang, Barbara Dworakowska, Kazuki Matsumoto, Guangyao Zheng, Gerben Sewuster, Jorge Pretel Villanueva, Ivan Rannev, Igor Chernyavsky, Jiale Chen, Deepayan Banik, Ben Racz, Wenchao Dong, Jianxin Wang, Laila Bashmal, Duarte V. Gonçalves, Wei Hu, Kaushik Bar, Ondrej Bohdal, Atharv Singh Patlan, Shehzaad Dhuliawala, Caroline Geirhos, Julien Wist, Yuval Kansal, Bingsen Chen, Kutay Tire, Atak Talay Yücel, Brandon Christof, Veerupaksh Singla, Zijian Song, Sanxing Chen, Jiaxin Ge, Kaustubh Ponkshe, Isaac Park, Tianneng Shi, Martin Q. Ma, Joshua Mak, Sherwin Lai, Antoine Moulin, Zhuo Cheng, Zhanda Zhu, Ziyi Zhang, Vaidehi Patil, Ketan Jha, Qiutong Men, Jiaxuan Wu, Tianchi Zhang, Bruno Hebling Vieira, Alham Fikri Aji, Jae-Won Chung, Mohammed Mahfoud, Ha Thi Hoang, Marc Sperzel, Wei Hao, Kristof Meding, Sihan Xu, Vassilis Kostakos, Davide Manini, Yueying Liu, Christopher Toukmaji, Jay Paek, Eunmi Yu, Arif Engin Demircali, Zhiyi Sun, Ivan Dewerpe, Hongsen Qin, Roman Pflugfelder, James Bailey, Johnathan Morris, Ville Heilala, Sybille Rosset, Zishun Yu, Peter E. Chen, Woongyeong Yeo, Eeshaan Jain, Ryan Yang, Sreekar Chigurupati, Julia Chernyavsky, Sai Prajwal Reddy, Subhashini Venugopalan, Hunar Batra, Core Francisco Park, Hieu Tran, Guilherme Maximiano, Genghan Zhang, Yizhuo Liang, Hu Shiyu, Rongwu Xu, Rui Pan, Siddharth Suresh, Ziqi Liu, Samaksh Gulati, Songyang Zhang, Peter Turchin, Christopher W. Bartlett, Christopher R. Scotese, Phuong M. Cao, Aakaash Nattanmai, Gordon McKellips, Anish Cheraku, Asim Suhail, Ethan Luo, Marvin Deng, Jason Luo, Ashley Zhang, Kavin Jindel, Jay Paek, Kasper Halevy, Allen Baranov, Michael Liu, Advaith Avadhanam, David Zhang, Vincent Cheng, Brad Ma, Evan Fu, Liam Do, Joshua Lass, Hubert Yang, Surya Sunkari, Vishruth Bharath, Violet Ai, James Leung, Rishit Agrawal, Alan Zhou, Kevin Chen, Tejas Kalpathi, Ziqi Xu, Gavin Wang, Tyler Xiao, Erik Maung, Sam Lee, Ryan Yang, Roy Yue, Ben Zhao, Julia Yoon, Sunny Sun, Aryan Singh, Ethan Luo, Clark Peng, Tyler Osbey, Taozhi Wang, Daryl Echeazu, Hubert Yang, Timothy Wu, Spandan Patel, Vidhi Kulkarni, Vijaykaarti Sundarapandiyan, Ashley Zhang, Andrew Le, Zafir Nasim, Srikar Yalam, Ritesh Kasamsetty, Soham Samal, Hubert Yang, David Sun, Nihar Shah, Abhijeet Saha, Alex Zhang, Leon Nguyen, Laasya Nagumalli, Kaixin Wang, Alan Zhou, Aidan Wu, Jason Luo, Anwith Telluri, Summer Yue, Alexandr Wang, Dan Hendrycks
Benchmarks are important tools for tracking the rapid advancements in large
language model (LLM) capabilities. However, benchmarks are not keeping pace in
difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like
MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In
response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at
the frontier of human knowledge, designed to be the final closed-ended academic
benchmark of its kind with broad subject coverage. HLE consists of 2,500
questions across dozens of subjects, including mathematics, humanities, and the
natural sciences. HLE is developed globally by subject-matter experts and
consists of multiple-choice and short-answer questions suitable for automated
grading. Each question has a known solution that is unambiguous and easily
verifiable, but cannot be quickly answered via internet retrieval.
State-of-the-art LLMs demonstrate low accuracy and calibration on HLE,
highlighting a significant gap between current LLM capabilities and the expert
human frontier on closed-ended academic questions. To inform research and
policymaking upon a clear understanding of model capabilities, we publicly
release HLE at https://lastexam.ai.
comment: 29 pages, 6 figures
♻ ☆ Agentic Vehicles for Human-Centered Mobility Systems
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity
to operate according to internal rules without external control. Autonomous
vehicles (AuVs) are therefore understood as systems that perceive their
environment and execute pre-programmed tasks independently of external input,
consistent with the SAE levels of automated driving. Yet recent research and
real-world deployments have begun to showcase vehicles that exhibit behaviors
outside the scope of this definition. These include natural language
interaction with humans, goal adaptation, contextual reasoning, external tool
use, and the handling of unforeseen ethical dilemmas, enabled in part by
multimodal large language models (LLMs). These developments highlight not only
a gap between technical autonomy and the broader cognitive and social
capacities required for human-centered mobility, but also the emergence of a
form of vehicle intelligence that currently lacks a clear designation. To
address this gap, the paper introduces the concept of agentic vehicles (AgVs):
vehicles that integrate agentic AI systems to reason, adapt, and interact
within complex environments. It synthesizes recent advances in agentic systems
and suggests how AgVs can complement and even reshape conventional autonomy to
ensure mobility services are aligned with user and societal needs. The paper
concludes by outlining key challenges in the development and governance of AgVs
and their potential role in shaping future agentic transportation systems.
♻ ☆ Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang, Jiecao Chen
Effective tool use is essential for large language models (LLMs) to interact
meaningfully with their environment. However, progress is limited by the lack
of efficient reinforcement learning (RL) frameworks specifically designed for
tool use, due to challenges in constructing stable training environments and
designing verifiable reward mechanisms. To address this, we propose an
automated environment construction pipeline, incorporating scenario
decomposition, document generation, function integration, complexity scaling,
and localized deployment. This enables the creation of high-quality training
environments that provide detailed and measurable feedback without relying on
external tools. Additionally, we introduce a verifiable reward mechanism that
evaluates both the precision of tool use and the completeness of task
execution. When combined with trajectory data collected from the constructed
environments, this mechanism integrates seamlessly with standard RL algorithms
to facilitate feedback-driven model training. Experiments on LLMs of varying
scales demonstrate that our approach significantly enhances the models'
tool-use performance without degrading their general capabilities, regardless
of inference modes or training algorithms. Our analysis suggests that these
gains result from improved context understanding and reasoning, driven by
updates to the lower-layer MLP parameters in models.
♻ ☆ FinMTEB: Finance Massive Text Embedding Benchmark EMNLP 2025
Embedding models play a crucial role in representing and retrieving
information across various NLP applications. Recent advances in large language
models (LLMs) have further enhanced the performance of embedding models. While
these models are often benchmarked on general-purpose datasets, real-world
applications demand domain-specific evaluation. In this work, we introduce the
Finance Massive Text Embedding Benchmark (FinMTEB), a specialized counterpart
to MTEB designed for the financial domain. FinMTEB comprises 64 financial
domain-specific embedding datasets across 7 tasks that cover diverse textual
types in both Chinese and English, such as financial news articles, corporate
annual reports, ESG reports, regulatory filings, and earnings call transcripts.
We also develop a finance-adapted model, Fin-E5, using a persona-based data
synthetic method to cover diverse financial embedding tasks for training.
Through extensive evaluation of 15 embedding models, including Fin-E5, we show
three key findings: (1) performance on general-purpose benchmarks shows limited
correlation with financial domain tasks; (2) domain-adapted models consistently
outperform their general-purpose counterparts; and (3) surprisingly, a simple
Bag-of-Words (BoW) approach outperforms sophisticated dense embeddings in
financial Semantic Textual Similarity (STS) tasks, underscoring current
limitations in dense embedding techniques. Our work establishes a robust
evaluation framework for financial NLP applications and provides crucial
insights for developing domain-specific embedding models.
comment: EMNLP 2025, https://github.com/yixuantt/FinMTEB
♻ ☆ DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech
Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that
mimics the voice of an unseen speaker using only a short reference sample,
requiring not only speaker adaptation but also accurate modeling of prosodic
attributes. Recent approaches based on language models, diffusion, and flow
matching have shown promising results in zero-shot TTS, but still suffer from
slow inference and repetition artifacts. Discrete codec representations have
been widely adopted for speech synthesis, and recent works have begun to
explore diffusion models in purely discrete settings, suggesting the potential
of discrete generative modeling for speech synthesis. However, existing
flow-matching methods typically embed these discrete tokens into a continuous
space and apply continuous flow matching, which may not fully leverage the
advantages of discrete representations. To address these challenges, we
introduce DiFlow-TTS, which, to the best of our knowledge, is the first model
to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS
explicitly models factorized speech attributes within a compact and unified
architecture. It leverages in-context learning by conditioning on textual
content, along with prosodic and acoustic attributes extracted from a reference
speech, enabling effective attribute cloning in a zero-shot setting. In
addition, the model employs a factorized flow prediction mechanism with
distinct heads for prosody and acoustic details, allowing it to learn
aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS
achieves promising performance in several key metrics, including naturalness,
prosody, preservation of speaker style, and energy control. It also maintains a
compact model size and achieves low-latency inference, generating speech up to
25.8 times faster than the latest existing baselines.
♻ ☆ Atomic Fact Decomposition Helps Attributed Question Answering
Attributed Question Answering (AQA) aims to provide both a trustworthy answer
and a reliable attribution report for a given question. Retrieval is a widely
adopted approach, including two general paradigms: Retrieval-Then-Read (RTR)
and post-hoc retrieval. Recently, Large Language Models (LLMs) have shown
remarkable proficiency, prompting growing interest in AQA among researchers.
However, RTR-based AQA often suffers from irrelevant knowledge and rapidly
changing information, even when LLMs are adopted, while post-hoc
retrieval-based AQA struggles with comprehending long-form answers with complex
logic, and precisely identifying the content needing revision and preserving
the original intent. To tackle these problems, this paper proposes an Atomic
fact decomposition-based Retrieval and Editing (ARE) framework, which
decomposes the generated long-form answers into molecular clauses and atomic
facts by the instruction-tuned LLMs. Notably, the instruction-tuned LLMs are
fine-tuned using a well-constructed dataset, generated from large scale
Knowledge Graphs (KGs). This process involves extracting one-hop neighbors from
a given set of entities and transforming the result into coherent long-form
text. Subsequently, ARE leverages a search engine to retrieve evidences related
to atomic facts, inputting these evidences into an LLM-based verifier to
determine whether the facts require expansion for re-retrieval or editing.
Furthermore, the edited facts are backtracked into the original answer, with
evidence aggregated based on the relationship between molecular clauses and
atomic facts. Extensive evaluations demonstrate the superior performance of our
proposed method over the state-of-the-arts on several datasets, with an
additionally proposed new metric $Attr_{p}$ for evaluating the precision of
evidence attribution.
♻ ☆ Faster and Better LLMs via Latency-Aware Test-Time Scaling
Test-Time Scaling (TTS) has proven effective in improving the performance of
Large Language Models (LLMs) during inference. However, existing research has
overlooked the efficiency of TTS from a latency-sensitive perspective. Through
a latency-aware evaluation of representative TTS methods, we demonstrate that a
compute-optimal TTS does not always result in the lowest latency in scenarios
where latency is critical. To address this gap and achieve latency-optimal TTS,
we propose two key approaches by optimizing the concurrency configurations: (1)
branch-wise parallelism, which leverages multiple concurrent inference
branches, and (2) sequence-wise parallelism, enabled by speculative decoding.
By integrating these two approaches and allocating computational resources
properly to each, our latency-optimal TTS enables a 32B model to reach 82.3%
accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4%
within 10 seconds. Our work emphasizes the importance of latency-aware TTS and
demonstrates its ability to deliver both speed and accuracy in
latency-sensitive scenarios.