Computation and Language 37
☆ InfAlign: Inference-aware language model alignment
Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, and Ananda Theertha Suresh, Ahmad Beirami
Language model alignment has become a critical step in training modern
generative language models. The goal of alignment is to finetune a reference
model such that the win rate of a sample from the aligned model over a sample
from the reference model is high, subject to a KL divergence constraint. Today,
we are increasingly using inference-time algorithms (e.g., Best-of-N,
controlled decoding, tree search) to decode from language models rather than
standard sampling. However, the alignment objective does not capture such
inference-time decoding procedures. We show that the existing alignment
framework is sub-optimal in view of such inference-time methods. We then modify
the alignment objective and propose a framework for inference-aware alignment
(IAPO). We prove that for any inference-time decoding algorithm, the optimal
solution that optimizes the inference-time win rate of the aligned policy
against the reference policy is the solution to the typical RLHF problem with a
transformation of the reward. This motivates us to provide the KL-regularized
calibrate-and-transform RL (CTRL) algorithm to solve this problem, which
involves a reward calibration step and a KL-regularized reward maximization
step with a transformation of the calibrated reward. We particularize our study
to two important inference-time strategies: best-of-N sampling and best-of-N
jailbreaking, where N responses are sampled from the model and the one with the
highest or lowest reward is selected. We propose specific transformations for
these strategies and demonstrate that our framework offers significant
improvements over existing state-of-the-art methods for language model
alignment. Empirically, we outperform baselines that are designed without
taking inference-time decoding into consideration by 8-12% and 4-9% on
inference-time win rates over the Anthropic helpfulness and harmlessness dialog
benchmark datasets.
☆ Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization ICASSP 2025
Automatic speech recognition has recently seen a significant advancement with
large foundational models such as Whisper. However, these models often struggle
to perform well in low-resource languages, such as Indian languages. This paper
explores two novel approaches to enhance Whisper's multilingual speech
recognition performance in Indian languages. First, we propose prompt-tuning
with language family information, which enhances Whisper's accuracy in
linguistically similar languages. Second, we introduce a novel tokenizer that
reduces the number of generated tokens, thereby accelerating Whisper's
inference speed. Our extensive experiments demonstrate that the tokenizer
significantly reduces inference time, while prompt-tuning enhances accuracy
across various Whisper model sizes, including Small, Medium, and Large.
Together, these techniques achieve a balance between optimal WER and inference
speed.
comment: Accepted at ICASSP 2025, 5 pages, 1 figures, 5 tables
☆ Machine Learning for Sentiment Analysis of Imported Food in Trinidad and Tobago
This research investigates the performance of various machine learning
algorithms (CNN, LSTM, VADER, and RoBERTa) for sentiment analysis of Twitter
data related to imported food items in Trinidad and Tobago. The study addresses
three primary research questions: the comparative accuracy and efficiency of
the algorithms, the optimal configurations for each model, and the potential
applications of the optimized models in a live system for monitoring public
sentiment and its impact on the import bill. The dataset comprises tweets from
2018 to 2024, divided into imbalanced, balanced, and temporal subsets to assess
the impact of data balancing and the COVID-19 pandemic on sentiment trends. Ten
experiments were conducted to evaluate the models under various configurations.
Results indicated that VADER outperformed the other models in both multi-class
and binary sentiment classifications. The study highlights significant changes
in sentiment trends pre- and post-COVID-19, with implications for import
policies.
comment: 27 pages
☆ OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu
Graphical User Interface (GUI) agents powered by Vision-Language Models
(VLMs) have demonstrated human-like computer control capability. Despite their
utility in advancing digital automation, a critical bottleneck persists:
collecting high-quality trajectory data for training. Common practices for
collecting such data rely on human supervision or synthetic data generation
through executing pre-defined tasks, which are either resource-intensive or
unable to guarantee data quality. Moreover, these methods suffer from limited
data diversity and significant gaps between synthetic data and real-world
environments. To address these challenges, we propose OS-Genesis, a novel GUI
data synthesis pipeline that reverses the conventional trajectory collection
process. Instead of relying on pre-defined tasks, OS-Genesis enables agents
first to perceive environments and perform step-wise interactions, then
retrospectively derive high-quality tasks to enable trajectory-level
exploration. A trajectory reward model is then employed to ensure the quality
of the generated trajectories. We demonstrate that training GUI agents with
OS-Genesis significantly improves their performance on highly challenging
online benchmarks. In-depth analysis further validates OS-Genesis's efficiency
and its superior data quality and diversity compared to existing synthesis
methods. Our codes, data, and checkpoints are available at
\href{https://qiushisun.github.io/OS-Genesis-Home/}{OS-Genesis Homepage}.
comment: Work in progress
☆ Toward Adaptive Reasoning in Large Language Models with Thought Rollback ICML 2024
Large language models (LLMs) have been routinely used to solve various tasks
using step-by-step reasoning. However, the structure of intermediate reasoning
steps, or thoughts, is rigid and unidirectional, such as chains, trees, or
acyclic-directed graphs. Consequently, the resulting inflexible and
forward-only reasoning may not address challenging tasks and fail when the LLM
frequently gives false responses, i.e., ``hallucinations''. This paper proposes
a new reasoning framework, called Thought Rollback (TR), allowing LLMs to
adaptively build thought structure while maintaining effective reasoning toward
problem-solving under ``hallucinations''. The core mechanism of TR is rolling
back thoughts, which allows LLMs to perform error analysis on thoughts, and
thus roll back to any previously mistaken thought for revision. Subsequently,
by including such trial-and-error in the prompt to guide the LLM, each rollback
leads to one more reliable reasoning path. Therefore, starting with a simple
prompt without human annotations, LLM with TR adaptively and gradually explores
thoughts for a correct solution. Comprehensive experiments on mathematical
problems and multi-task reasoning demonstrate the state-of-the-art performance
of TR in terms of problem-solving rate and interaction cost. For instance, the
solving rate of GPT-4 with TR outperforms the current best by $9\%$ on the MATH
dataset.
comment: ICML 2024 camera-ready version with 24 pages and 12 figures. Code
repo with all prompts:
https://github.com/iQua/llmpebase/tree/main/examples/ThoughtRollback
☆ Machine Generated Product Advertisements: Benchmarking LLMs Against Human Performance
This study compares the performance of AI-generated and human-written product
descriptions using a multifaceted evaluation model. We analyze descriptions for
100 products generated by four AI models (Gemma 2B, LLAMA, GPT2, and ChatGPT 4)
with and without sample descriptions, against human-written descriptions. Our
evaluation metrics include sentiment, readability, persuasiveness, Search
Engine Optimization(SEO), clarity, emotional appeal, and call-to-action
effectiveness. The results indicate that ChatGPT 4 performs the best. In
contrast, other models demonstrate significant shortcomings, producing
incoherent and illogical output that lacks logical structure and contextual
relevance. These models struggle to maintain focus on the product being
described, resulting in disjointed sentences that do not convey meaningful
information. This research provides insights into the current capabilities and
limitations of AI in the creation of content for e-Commerce.
☆ A Comparative Study of Machine Unlearning Techniques for Image and Text Classification Models
Omar M. Safa, Mahmoud M. Abdelaziz, Mustafa Eltawy, Mohamed Mamdouh, Moamen Gharib, Salaheldin Eltenihy, Nagia M. Ghanem, Mohamed M. Ismail
Machine Unlearning has emerged as a critical area in artificial intelligence,
addressing the need to selectively remove learned data from machine learning
models in response to data privacy regulations. This paper provides a
comprehensive comparative analysis of six state-of-theart unlearning techniques
applied to image and text classification tasks. We evaluate their performance,
efficiency, and compliance with regulatory requirements, highlighting their
strengths and limitations in practical scenarios. By systematically analyzing
these methods, we aim to provide insights into their applicability,
challenges,and tradeoffs, fostering advancements in the field of ethical and
adaptable machine learning.
☆ TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data
Semantic parsing, which converts natural language questions into logic forms,
plays a crucial role in reasoning within structured environments. However,
existing methods encounter two significant challenges: reliance on extensive
manually annotated datasets and limited generalization capability to unseen
examples. To tackle these issues, we propose Targeted Synthetic Data Generation
(TARGA), a practical framework that dynamically generates high-relevance
synthetic data without manual annotation. Starting from the pertinent entities
and relations of a given question, we probe for the potential relevant queries
through layer-wise expansion and cross-layer combination. Then we generate
corresponding natural language questions for these constructed queries to
jointly serve as the synthetic demonstrations for in-context learning.
Experiments on multiple knowledge base question answering (KBQA) datasets
demonstrate that TARGA, using only a 7B-parameter model, substantially
outperforms existing non-fine-tuned methods that utilize close-sourced model,
achieving notable improvements in F1 scores on GrailQA(+7.7) and
KBQA-Agent(+12.2). Furthermore, TARGA also exhibits superior sample efficiency,
robustness, and generalization capabilities under non-I.I.D. settings.
☆ Exploiting Domain-Specific Parallel Data on Multilingual Language Models for Low-resource Language Translation
Surangika Ranathungaa, Shravan Nayak, Shih-Ting Cindy Huang, Yanke Mao, Tong Su, Yun-Hsiang Ray Chan, Songchen Yuan, Anthony Rinaldi, Annie En-Shiun Lee
Neural Machine Translation (NMT) systems built on multilingual
sequence-to-sequence Language Models (msLMs) fail to deliver expected results
when the amount of parallel data for a language, as well as the language's
representation in the model are limited. This restricts the capabilities of
domain-specific NMT systems for low-resource languages (LRLs). As a solution,
parallel data from auxiliary domains can be used either to fine-tune or to
further pre-train the msLM. We present an evaluation of the effectiveness of
these two techniques in the context of domain-specific LRL-NMT. We also explore
the impact of domain divergence on NMT model performance. We recommend several
strategies for utilizing auxiliary parallel data in building domain-specific
NMT models for LRLs.
☆ Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
Large Language Models (LLMs) can correct their self-generated responses, but
a decline in accuracy after self-correction is also witnessed. To have a deeper
understanding of self-correction, we endeavor to decompose, evaluate, and
analyze the self-correction behaviors of LLMs. By enumerating and analyzing
answer correctness before and after self-correction, we decompose the
self-correction capability into confidence (being confident to correct answers)
and critique (turning wrong answers to correct) capabilities, and propose two
metrics from a probabilistic perspective to measure these 2 capabilities, along
with another metric for overall self-correction capability evaluation. Based on
our decomposition and evaluation metrics, we conduct extensive experiments and
draw some empirical conclusions. For example, we find different models can
exhibit distinct behaviors: some models are confident while others are more
critical. We also find the trade-off between the two capabilities (i.e.
improving one can lead to a decline in the other) when manipulating model
self-correction behavior by prompts or in-context learning. Further, we find a
simple yet efficient strategy to improve self-correction capability by
transforming Supervision Fine-Tuning (SFT) data format, and our strategy
outperforms vanilla SFT in both capabilities and achieves much higher accuracy
after self-correction. Our code will be publicly available on GitHub.
comment: 16 pages, 10 figures
☆ Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Fine-tuning large language models (LLMs) for downstream tasks is a widely
adopted approach, but it often leads to safety degradation in safety-aligned
LLMs. Currently, many solutions address this issue by incorporating additional
safety data, which can be impractical in many cases. In this paper, we address
the question: How can we improve downstream task performance while preserving
safety in LLMs without relying on additional safety data? We propose a simple
and effective method that maintains the inherent safety of LLMs while enhancing
their downstream task performance: merging the weights of pre- and
post-fine-tuned safety-aligned models. Experimental results across various
downstream tasks, models, and merging methods demonstrate that this approach
effectively mitigates safety degradation while improving downstream task
performance, offering a practical solution for adapting safety-aligned LLMs.
☆ User Willingness-aware Sales Talk Dataset COLING2025
User willingness is a crucial element in the sales talk process that affects
the achievement of the salesperson's or sales system's objectives. Despite the
importance of user willingness, to the best of our knowledge, no previous study
has addressed the development of automated sales talk dialogue systems that
explicitly consider user willingness. A major barrier is the lack of sales talk
datasets with reliable user willingness data. Thus, in this study, we developed
a user willingness-aware sales talk collection by leveraging the ecological
validity concept, which is discussed in the field of human-computer
interaction. Our approach focused on three types of user willingness essential
in real sales interactions. We created a dialogue environment that closely
resembles real-world scenarios to elicit natural user willingness, with
participants evaluating their willingness at the utterance level from multiple
perspectives. We analyzed the collected data to gain insights into practical
user willingness-aware sales talk strategies. In addition, as a practical
application of the constructed dataset, we developed and evaluated a sales
dialogue system aimed at enhancing the user's intent to purchase.
comment: 12 pages, Accepted to COLING2025
☆ Pre-training, Fine-tuning and Re-ranking: A Three-Stage Framework for Legal Question Answering
Legal question answering (QA) has attracted increasing attention from people
seeking legal advice, which aims to retrieve the most applicable answers from a
large-scale database of question-answer pairs. Previous methods mainly use a
dual-encoder architecture to learn dense representations of both questions and
answers. However, these methods could suffer from lacking domain knowledge and
sufficient labeled training data. In this paper, we propose a three-stage
(\underline{p}re-training, \underline{f}ine-tuning and \underline{r}e-ranking)
framework for \underline{l}egal \underline{QA} (called PFR-LQA), which promotes
the fine-grained text representation learning and boosts the performance of
dense retrieval with the dual-encoder architecture. Concretely, we first
conduct domain-specific pre-training on legal questions and answers through a
self-supervised training objective, allowing the pre-trained model to be
adapted to the legal domain. Then, we perform task-specific fine-tuning of the
dual-encoder on legal question-answer pairs by using the supervised learning
objective, leading to a high-quality dual-encoder for the specific downstream
QA task. Finally, we employ a contextual re-ranking objective to further refine
the output representations of questions produced by the document encoder, which
uses contextual similarity to increase the discrepancy between the anchor and
hard negative samples for better question re-ranking. We conduct extensive
experiments on a manually annotated legal QA dataset. Experimental results show
that our PFR-LQA method achieves better performance than the strong competitors
for legal question answering.
☆ Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models
This study proposes a knowledge distillation algorithm based on large
language models and feature alignment, aiming to effectively transfer the
knowledge of large pre-trained models into lightweight student models, thereby
reducing computational costs while maintaining high model performance.
Different from the traditional soft label distillation method, this method
introduces a multi-layer feature alignment strategy to deeply align the
intermediate features and attention mechanisms of the teacher model and the
student model, maximally retaining the semantic expression ability and context
modeling ability of the teacher model. In terms of method design, a multi-task
loss function is constructed, including feature matching loss, attention
alignment loss, and output distribution matching loss, to ensure multi-level
information transfer through joint optimization. The experiments were
comprehensively evaluated on the GLUE data set and various natural language
processing tasks. The results show that the proposed model performs very close
to the state-of-the-art GPT-4 model in terms of evaluation indicators such as
perplexity, BLEU, ROUGE, and CER. At the same time, it far exceeds baseline
models such as DeBERTa, XLNet, and GPT-3, showing significant performance
improvements and computing efficiency advantages. Research results show that
the feature alignment distillation strategy is an effective model compression
method that can significantly reduce computational overhead and storage
requirements while maintaining model capabilities. Future research can be
further expanded in the directions of self-supervised learning, cross-modal
feature alignment, and multi-task transfer learning to provide more flexible
and efficient solutions for the deployment and optimization of deep learning
models.
comment: 4 pages
☆ DeepSeek-V3 Technical Report
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, Zizheng Pan
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with
671B total parameters with 37B activated for each token. To achieve efficient
inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent
Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated
in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free
strategy for load balancing and sets a multi-token prediction training
objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion
diverse and high-quality tokens, followed by Supervised Fine-Tuning and
Reinforcement Learning stages to fully harness its capabilities. Comprehensive
evaluations reveal that DeepSeek-V3 outperforms other open-source models and
achieves performance comparable to leading closed-source models. Despite its
excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its
full training. In addition, its training process is remarkably stable.
Throughout the entire training process, we did not experience any irrecoverable
loss spikes or perform any rollbacks. The model checkpoints are available at
https://github.com/deepseek-ai/DeepSeek-V3.
♻ ☆ Reasoning over Uncertain Text by Generative Large Language Models
This paper considers the challenges Large Language Models (LLMs) face when
reasoning over text that includes information involving uncertainty explicitly
quantified via probability values. This type of reasoning is relevant to a
variety of contexts ranging from everyday conversations to medical
decision-making. Despite improvements in the mathematical reasoning
capabilities of LLMs, they still exhibit significant difficulties when it comes
to probabilistic reasoning. To deal with this problem, we introduce the
Bayesian Linguistic Inference Dataset (BLInD), a new dataset specifically
designed to test the probabilistic reasoning capabilities of LLMs. We use BLInD
to find out the limitations of LLMs for tasks involving probabilistic
reasoning. In addition, we present several prompting strategies that map the
problem to different formal representations, including Python code,
probabilistic algorithms, and probabilistic logical programming. We conclude by
providing an evaluation of our methods on BLInD and an adaptation of a causal
reasoning question-answering dataset. Our empirical results highlight the
effectiveness of our proposed strategies for multiple LLMs.
♻ ☆ CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
Deploying large language models (LLMs) on edge devices presents significant
challenges due to the substantial computational overhead and memory
requirements. Activation sparsification can mitigate these resource challenges
by reducing the number of activated neurons during inference. Existing methods
typically employ thresholding-based sparsification based on the statistics of
activation tensors. However, they do not model the impact of activation
sparsification on performance, resulting in suboptimal performance degradation.
To address the limitations, this paper reformulates the activation
sparsification problem to explicitly capture the relationship between
activation sparsity and model performance. Then, this paper proposes CHESS, a
general activation sparsification approach via CHannel-wise thrEsholding and
Selective Sparsification. First, channel-wise thresholding assigns a unique
threshold to each activation channel in the feed-forward network (FFN) layers.
Then, selective sparsification involves applying thresholding-based activation
sparsification to specific layers within the attention modules. Finally, we
detail the implementation of sparse kernels to accelerate LLM inference.
Experimental results demonstrate that the proposed CHESS achieves lower
performance degradation over eight downstream tasks while activating fewer
parameters than existing methods, thus speeding up the LLM inference by up to
1.27x.
♻ ☆ Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? NeurIPS 2024
Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks
As artificial intelligence systems grow more powerful, there has been
increasing interest in "AI safety" research to address emerging and future
risks. However, the field of AI safety remains poorly defined and
inconsistently measured, leading to confusion about how researchers can
contribute. This lack of clarity is compounded by the unclear relationship
between AI safety benchmarks and upstream general capabilities (e.g., general
knowledge and reasoning). To address these issues, we conduct a comprehensive
meta-analysis of AI safety benchmarks, empirically analyzing their correlation
with general capabilities across dozens of models and providing a survey of
existing directions in AI safety. Our findings reveal that many safety
benchmarks highly correlate with both upstream model capabilities and training
compute, potentially enabling "safetywashing"--where capability improvements
are misrepresented as safety advancements. Based on these findings, we propose
an empirical foundation for developing more meaningful safety metrics and
define AI safety in a machine learning research context as a set of clearly
delineated research goals that are empirically separable from generic
capabilities advancements. In doing so, we aim to provide a more rigorous
framework for AI safety research, advancing the science of safety evaluations
and clarifying the path towards measurable progress.
comment: NeurIPS 2024
♻ ☆ Context-aware Inductive Knowledge Graph Completion with Latent Type Constraints and Subgraph Reasoning
Inductive knowledge graph completion (KGC) aims to predict missing triples
with unseen entities. Recent works focus on modeling reasoning paths between
the head and tail entity as direct supporting evidence. However, these methods
depend heavily on the existence and quality of reasoning paths, which limits
their general applicability in different scenarios. In addition, we observe
that latent type constraints and neighboring facts inherent in KGs are also
vital in inferring missing triples. To effectively utilize all useful
information in KGs, we introduce CATS, a novel context-aware inductive KGC
solution. With sufficient guidance from proper prompts and supervised
fine-tuning, CATS activates the strong semantic understanding and reasoning
capabilities of large language models to assess the existence of query triples,
which consist of two modules. First, the type-aware reasoning module evaluates
whether the candidate entity matches the latent entity type as required by the
query relation. Then, the subgraph reasoning module selects relevant reasoning
paths and neighboring facts, and evaluates their correlation to the query
triple. Experiment results on three widely used datasets demonstrate that CATS
significantly outperforms state-of-the-art methods in 16 out of 18
transductive, inductive, and few-shot settings with an average absolute MRR
improvement of 7.2%.
♻ ☆ Intertwining CP and NLP: The Generation of Unreasonably Constrained Sentences
Constrained text generation remains a challenging task, particularly when
dealing with hard constraints. Traditional NLP approaches prioritize generating
meaningful and coherent output. Also, the current state-of-the-art methods
often lack the expressiveness and constraint satisfaction capabilities to
handle such tasks effectively. Recently, an approach for generating constrained
sentences in CP has been proposed in (Bonlarron et al, 2023). This ad-hoc model
to solve the sentences generation problem under MNREAD rules proved
neithertheless to be computationaly and structuraly unsuitable to deal with
other more constrained problems. In this paper, a novel more generic approach
is introduced to tackle many of these previously untractable problems, and
illustrated here with the quite untractable sentences generation problem
following RADNER rules.
More precisely, this paper presents the CPTextGen Framework. This framework
considers a constrained text generation problem as a discrete combinatorial
optimization problem. It is solved by a constraint programming method that
combines linguistic properties (e.g., n-grams or language level) with other
more classical constraints (e.g., the number of characters, syllables).
Eventually, a curation phase allows for selecting the best-generated sentences
according to perplexity using an LLM.
The effectiveness of this approach is demonstrated by tackling a new, more
tediously constrained text generation problem: the iconic RADNER sentences
problem. This problem aims to generate sentences respecting a set of quite
strict rules defined by their use in vision and clinical research. Thanks to
our CP-based approach, many new strongly constrained sentences have been
successfully generated. This highlights our approach's potential to handle
unreasonably constrained text generation scenarios.
comment: Disambiguation and additional references
♻ ☆ Baichuan-Omni Technical Report
Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, Weipeng Chen
The salient multimodal capabilities and interactive experience of GPT-4o
highlight its critical role in practical applications, yet it lacks a
high-performing open-source counterpart. In this paper, we introduce
Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM)
adept at concurrently processing and analyzing modalities of image, video,
audio, and text, while delivering an advanced multimodal interactive experience
and strong performance. We propose an effective multimodal training schema
starting with 7B model and proceeding through two stages of multimodal
alignment and multitask fine-tuning across audio, image, video, and text modal.
This approach equips the language model with the ability to handle visual and
audio data effectively. Demonstrating strong performance across various
omni-modal and multimodal benchmarks, we aim for this contribution to serve as
a competitive baseline for the open-source community in advancing multimodal
understanding and real-time interaction.
♻ ☆ Preemptive Detection and Correction of Misaligned Actions in LLM Agents
Deploying LLM-based agents in real-life applications often faces a critical
challenge: the misalignment between agents' behavior and user intent. Such
misalignment may lead agents to unintentionally execute critical actions that
carry negative outcomes (e.g., accidentally triggering a "buy-now" in web
shopping), resulting in undesirable or even irreversible consequences. Although
addressing these issues is crucial, the preemptive detection and correction of
misaligned actions remains relatively underexplored. To fill this gap, we
introduce InferAct, a novel approach that leverages the belief reasoning
ability of LLMs, grounded in Theory-of-Mind, to detect misaligned actions
before execution. Once the misalignment is detected, InferAct alerts users for
timely correction, preventing adverse outcomes and enhancing the reliability of
LLM agents' decision-making processes. Experiments on three widely used tasks
demonstrate that InferAct achieves up to 20% improvements on Marco-F1 against
baselines in misaligned action detection. An in-depth evaluation of
misalignment correction further highlights InferAct's effectiveness in
improving agent alignment.
♻ ☆ MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training ICLR 2024
Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Zili Wang, Yike Guo, Jie Fu
Self-supervised learning (SSL) has recently emerged as a promising paradigm
for training generalisable models on large-scale data in the fields of vision,
text, and speech. Although SSL has been proven effective in speech and audio,
its application to music audio has yet to be thoroughly explored. This is
partially due to the distinctive challenges associated with modelling musical
knowledge, particularly tonal and pitched characteristics of music. To address
this research gap, we propose an acoustic Music undERstanding model with
large-scale self-supervised Training (MERT), which incorporates teacher models
to provide pseudo labels in the masked language modelling (MLM) style acoustic
pre-training. In our exploration, we identified an effective combination of
teacher models, which outperforms conventional speech and audio approaches in
terms of performance. This combination includes an acoustic teacher based on
Residual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical
teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide
range of settings to overcome the instability in acoustic language model
pre-training, which allows our designed paradigm to scale from 95M to 330M
parameters. Experimental results indicate that our model can generalise and
perform well on 14 music understanding tasks and attain state-of-the-art (SOTA)
overall scores.
comment: accepted by ICLR 2024
♻ ☆ Blessing or curse? A survey on the Impact of Generative AI on Fake News
Fake news significantly influence our society. They impact consumers, voters,
and many other societal groups. While Fake News exist for a centuries,
Generative AI brings fake news on a new level. It is now possible to automate
the creation of masses of high-quality individually targeted Fake News. On the
other end, Generative AI can also help detecting Fake News. Both fields are
young but developing fast.
This survey provides a comprehensive examination of the research and
practical use of Generative AI for Fake News detection and creation in 2024.
Following the Structured Literature Survey approach, the paper synthesizes
current results in the following topic clusters 1) enabling technologies, 2)
creation of Fake News, 3) case study social media as most relevant distribution
channel, 4) detection of Fake News, and 5) deepfakes as upcoming technology.
The article also identifies current challenges and open issues.
comment: 16 pages, 2 figures. Submitted to ACM Transactions on Intelligent
Systems and Technology (ACM TIST). Added references
♻ ☆ Agent-OM: Leveraging LLM Agents for Ontology Matching
Ontology matching (OM) enables semantic interoperability between different
ontologies and resolves their conceptual heterogeneity by aligning related
entities. OM systems currently have two prevailing design paradigms:
conventional knowledge-based expert systems and newer machine learning-based
predictive systems. While large language models (LLMs) and LLM agents have
revolutionised data engineering and have been applied creatively in many
domains, their potential for OM remains underexplored. This study introduces a
novel agent-powered LLM-based design paradigm for OM systems. With
consideration of several specific challenges in leveraging LLM agents for OM,
we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
consisting of two Siamese agents for retrieval and matching, with a set of OM
tools. Our framework is implemented in a proof-of-concept system. Evaluations
of three Ontology Alignment Evaluation Initiative (OAEI) tracks over
state-of-the-art OM systems show that our system can achieve results very close
to the long-standing best performance on simple OM tasks and can significantly
improve the performance on complex and few-shot OM tasks.
comment: 19 pages, 12 figures, 3 tables
♻ ☆ Mamba for Streaming ASR Combined with Unimodal Aggregation ICASSP 2025
This paper works on streaming automatic speech recognition (ASR). Mamba, a
recently proposed state space model, has demonstrated the ability to match or
surpass Transformers in various tasks while benefiting from a linear complexity
advantage. We explore the efficiency of Mamba encoder for streaming ASR and
propose an associated lookahead mechanism for leveraging controllable future
information. Additionally, a streaming-style unimodal aggregation (UMA) method
is implemented, which automatically detects token activity and streamingly
triggers token output, and meanwhile aggregates feature frames for better
learning token representation. Based on UMA, an early termination (ET) method
is proposed to further reduce recognition latency. Experiments conducted on two
Mandarin Chinese datasets demonstrate that the proposed model achieves
competitive ASR performance in terms of both recognition accuracy and latency.
comment: Accepted by ICASSP 2025
♻ ☆ LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu
Large vision language models (LVLMs) have improved the document understanding
capabilities remarkably, enabling the handling of complex document elements,
longer contexts, and a wider range of tasks. However, existing document
understanding benchmarks have been limited to handling only a small number of
pages and fail to provide a comprehensive analysis of layout elements locating.
In this paper, we first define three primary task categories: Long Document
Understanding, numerical Reasoning, and cross-element Locating, and then
propose a comprehensive benchmark, LongDocURL, integrating above three primary
tasks and comprising 20 sub-tasks categorized based on different primary tasks
and answer evidences. Furthermore, we develop a semi-automated construction
pipeline and collect 2,325 high-quality question-answering pairs, covering more
than 33,000 pages of documents, significantly outperforming existing
benchmarks. Subsequently, we conduct comprehensive evaluation experiments on
both open-source and closed-source models across 26 different configurations,
revealing critical performance gaps in this field.
♻ ☆ Building a Taiwanese Mandarin Spoken Language Model: A First Attempt
Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, Yi-Cheng Lin, Yu-Xiang Lin, Wei-Chih Chen, Ho Lam Chung, Chun-Yi Kuan, Wei-Ping Huang, Ke-Han Lu, Tzu-Quan Lin, Hsiu-Hsuan Wang, En-Pei Hu, Chan-Jan Hsu, Liang-Hsuan Tseng, I-Hsiang Chiu, Ulin Sanga, Xuanjun Chen, Po-chun Hsu, Shu-wen Yang, Hung-yi Lee
This technical report presents our initial attempt to build a spoken large
language model (LLM) for Taiwanese Mandarin, specifically tailored to enable
real-time, speech-to-speech interaction in multi-turn conversations. Our
end-to-end model incorporates a decoder-only transformer architecture and aims
to achieve seamless interaction while preserving the conversational flow,
including full-duplex capabilities allowing simultaneous speaking and
listening. The paper also details the training process, including data
preparation with synthesized dialogues and adjustments for real-time
interaction. We also developed a platform to evaluate conversational fluency
and response coherence in multi-turn dialogues. We hope the release of the
report can contribute to the future development of spoken LLMs in Taiwanese
Mandarin.
comment: Work in progress
♻ ☆ Do LLMs Really Think Step-by-step In Implicit Reasoning?
It has been well-known that Chain-of-Thought can remarkably enhance LLMs'
performance on complex tasks. However, because it also introduces slower
inference speeds and higher computational costs, many researches have attempted
to use implicit CoT, which does not need LLMs to explicitly generate the
intermediate steps. However, the invisible reasoning process leaves us a doubt
that, can implicit CoT really be equal to explicit CoT? Therefore, in this
study, we address this question through experiments. We probe the information
of intermediate steps from the model's hidden states when it is either trained
or prompted to perform implicit CoT. The results surprisingly indicate that
when prompted, LLMs hardly think about intermediate steps, suggesting they may
just rely on experience rather than strict step-by-step reasoning. But when
trained, they indeed calculate intermediate steps. Moreover, in both
situations, we find the effect of using implicit CoT is susceptible to the
format of the problem, reaffirming the current deficiency of implicit CoT.
♻ ☆ A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens
Text embeddings from large language models (LLMs) have achieved excellent
results in tasks such as information retrieval, semantic textual similarity,
etc. In this work, we show an interesting finding: when feeding a text into the
LLM-based embedder, the obtained text embedding will be able to be aligned with
the key tokens in the input text. We first fully analyze this phenomenon on
eight LLM-based embedders and show that this phenomenon is universal and is not
affected by model architecture, training strategy, and embedding method. With a
deeper analysis, we find that the main change in embedding space between these
embedders and their LLM backbones is in the first principal component. By
adjusting the first principal component, we can align text embedding with the
key tokens. Finally, we give several examples to demonstrate the vast
application potential of this finding: (1) we propose a simple and practical
sparse retrieval method based on the aligned tokens, which can achieve 80% of
the dense retrieval effect of the same model while reducing the computation
significantly; (2) we show that our findings provide a novel perspective to
help understand novel technologies (e.g., instruction-following embedding) and
fuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.
comment: Work in Progress
♻ ☆ Multi-Agent Collaboration in Incident Response with Large Language Models
Incident response (IR) is a critical aspect of cybersecurity, requiring rapid
decision-making and coordinated efforts to address cyberattacks effectively.
Leveraging large language models (LLMs) as intelligent agents offers a novel
approach to enhancing collaboration and efficiency in IR scenarios. This paper
explores the application of LLM-based multi-agent collaboration using the
Backdoors & Breaches framework, a tabletop game designed for cybersecurity
training. We simulate real-world IR dynamics through various team structures,
including centralized, decentralized, and hybrid configurations. By analyzing
agent interactions and performance across these setups, we provide insights
into optimizing multi-agent collaboration for incident response. Our findings
highlight the potential of LLMs to enhance decision-making, improve
adaptability, and streamline IR processes, paving the way for more effective
and coordinated responses to cyber threats.
♻ ☆ Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models
Large language models (LLMs) demonstrate impressive capabilities to generate
accurate code snippets given natural language intents in a zero-shot manner,
i.e., without the need for specific fine-tuning. While prior studies have
highlighted the advantages of fine-tuning LLMs, this process incurs high
computational costs, making it impractical in resource-scarce environments,
particularly for models with billions of parameters. To address these
challenges, previous research explored in-context learning (ICL) and
retrieval-augmented generation (RAG) as strategies to guide the LLM generative
process with task-specific prompt examples. However, ICL and RAG introduce
inconveniences, such as the need for designing contextually relevant prompts
and the absence of learning task-specific parameters, thereby limiting
downstream task performance. In this context, we foresee parameter-efficient
fine-tuning (PEFT) as a promising approach to efficiently specialize LLMs to
task-specific data while maintaining reasonable resource consumption. In this
paper, we deliver a comprehensive study of PEFT techniques for LLMs in the
context of automated code generation. Our comprehensive investigation of PEFT
techniques for LLMs reveals their superiority and potential over ICL and RAG
across a diverse set of LLMs and three representative Python code generation
datasets: Conala, CodeAlpacaPy, and APPS. Furthermore, our study highlights the
potential for tuning larger LLMs and significant reductions in memory usage by
combining PEFT with quantization. Therefore, this study opens opportunities for
broader applications of PEFT in software engineering scenarios. Our code is
available at https://github.com/martin-wey/peft-llm-code/.
♻ ☆ CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
Evaluating the alignment of large language models (LLMs) with user-defined
coding preferences is a challenging endeavour that requires a deep assessment
of LLMs' outputs. Existing methods and benchmarks rely primarily on automated
metrics and static analysis tools, which often fail to capture the nuances of
user instructions and LLM outputs. To address this gap, we propose using the
LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding
preferences. Based on this approach, we present CodeUltraFeedback, a
comprehensive dataset designed to facilitate the evaluation and improvement of
LLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each
annotated with four responses generated from a diverse pool of 14 LLMs. These
responses are ranked based on five distinct coding preferences using GPT-3.5 as
a judge, providing both numerical scores and detailed textual feedback. Our
analysis of CodeUltraFeedback reveals that responses from GPT-3.5 and GPT-4 are
generally preferred over those from open-weight LLMs, highlighting significant
differences in alignment between closed and open-weight models. In turn, we
explore the usage of CodeUltraFeedback as feedback data to fine-tune and align
CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement
learning from AI feedback (RLAIF) with direct preference optimization (DPO).
The resulting aligned CodeLlama-7B-Instruct model outperforms larger LLMs in
terms of alignment with coding preferences and shows improved functional
correctness on the HumanEval+ benchmark compared to the original instruct
model. Therefore, our contributions bridge the gap in preference tuning of LLMs
for code and set the stage for further advancements in model alignment and
RLAIF in automated software engineering.
♻ ☆ Model Fusion through Bayesian Optimization in Language Model Fine-Tuning
Fine-tuning pre-trained models for downstream tasks is a widely adopted
technique known for its adaptability and reliability across various domains.
Despite its conceptual simplicity, fine-tuning entails several troublesome
engineering choices, such as selecting hyperparameters and determining
checkpoints from an optimization trajectory. To tackle the difficulty of
choosing the best model, one effective solution is model fusion, which combines
multiple models in a parameter space. However, we observe a large discrepancy
between loss and metric landscapes during the fine-tuning of pre-trained
language models. Building on this observation, we introduce a novel model
fusion technique that optimizes both the desired metric and loss through
multi-objective Bayesian optimization. In addition, to effectively select
hyperparameters, we establish a two-stage procedure by integrating Bayesian
optimization processes into our framework. Experiments across various
downstream tasks show considerable performance improvements using our Bayesian
optimization-guided method.
♻ ☆ Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Nicolo Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Roberto Navigli, Virendra Mehta, Matthew Blumberg, Victor May, Huu Nguyen, Sampo Pyysalo
Pretrained language models are an integral part of AI applications, but their
high computational cost for training limits accessibility. Initiatives such as
Bloom and StarCoder aim to democratize access to pretrained models for
collaborative community development. Despite these efforts, such models
encounter challenges such as limited multilingual capabilities, risks of
catastrophic forgetting during continual pretraining, and the high costs of
training models from scratch, alongside the need to align with AI safety
standards and regulatory frameworks.
This paper presents Aurora-M, a 15B parameter multilingual open-source model
trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually
pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T
tokens in total training token count. It is the first open-source multilingual
model fine-tuned on human-reviewed safety instructions, thus aligning its
development not only with conventional red-teaming considerations, but also
with the specific concerns articulated in the Biden-Harris Executive Order on
the Safe, Secure, and Trustworthy Development and Use of Artificial
Intelligence.
We evaluate Aurora-M across a wide range of tasks and languages, showcasing
its robustness against catastrophic forgetting and its superior performance in
multilingual settings, particularly in safety evaluations. We open-source
Aurora-M and its variants to encourage responsible open-source development of
large language models at https://huggingface.co/aurora-m.
comment: Preprint
♻ ☆ Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence
Jinghan He, Kuan Zhu, Haiyun Guo, Junfeng Fang, Zhenglin Hua, Yuheng Jia, Ming Tang, Tat-Seng Chua, Jinqiao Wang
Large vision-language models (LVLMs) have made substantial progress in
integrating large language models (LLMs) with visual inputs, enabling advanced
multimodal reasoning. Despite their success, a persistent challenge is
hallucination-where generated text fails to accurately reflect visual
content-undermining both accuracy and reliability. Existing methods focus on
alignment training or decoding refinements but primarily address symptoms at
the generation stage without probing the underlying causes. In this work, we
investigate the internal mechanisms driving hallucination in LVLMs, with an
emphasis on the multi-head attention module. Specifically, we introduce
Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of
attention head outputs to visual context. Based on this, our findings reveal
the presence of vision-aware attention heads that are more attuned to visual
information; however, the model's overreliance on its prior language patterns
is closely related to hallucinations. Building on these insights, we propose
Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate
hallucination by enhancing the role of vision-aware attention heads. Extensive
experiments demonstrate that our method achieves superior performance compared
to state-of-the-art approaches in mitigating hallucinations, while maintaining
high efficiency with negligible additional time overhead.
♻ ☆ Rules still work for Open Information Extraction
Open information extraction (OIE) aims to extract surface relations and their
corresponding arguments from natural language text, irrespective of domain.
This paper presents an innovative OIE model, APRCOIE, tailored for Chinese
text. Diverging from previous models, our model generates extraction patterns
autonomously. The model defines a new pattern form for Chinese OIE and proposes
an automated pattern generation methodology. In that way, the model can handle
a wide array of complex and diverse Chinese grammatical phenomena. We design a
preliminary filter based on tensor computing to conduct the extraction
procedure efficiently. To train the model, we manually annotated a large-scale
Chinese OIE dataset. In the comparative evaluation, we demonstrate that APRCOIE
outperforms state-of-the-art Chinese OIE models and significantly expands the
boundaries of achievable OIE performance. The code of APRCOIE and the annotated
dataset are released on GitHub (https://github.com/jialin666/APRCOIE_v1)