⛷️ Paper Under Submission

ArXiv

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

UNO (User log-driveN Optimization) is a framework for improving LLM systems from raw user logs. It distills unstructured logs into semi-structured rules and preference pairs, organizes heterogeneous feedback through query-and-rule clustering, and uses cognitive-gap assessment to build primary and reflective experience modules. UNO provides a systematic study of user-log-driven optimization and identifies the Signal-or-Noise Dilemma in feedback-based continual improvement.

📝 Publications

ICML 2026 Spotlight

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu

Code | HF Dataset

MemoryBench is a comprehensive benchmark for evaluating memory and continual learning in LLM systems. Unlike prior memory benchmarks that mainly reduce memory evaluation to static long-context reading comprehension, MemoryBench simulates how LLM systems learn from accumulated user feedback during service time. MemoryBench sets a new evaluation setting for LLM memory systems: continual learning from accumulated user feedback during service time.

AAAI 2026 Oral

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

Code | HF Model

RACE (Reasoning and Answer Consistency Evaluation) is a framework for detecting hallucinations in Large Reasoning Models (LRMs) by jointly analyzing both reasoning traces and final answers. It detects inconsistencies and hallucinations through multi-signal analysis, achieving robust and generalizable performance across models and datasets. RACE is the first to reveal that prior black-box hallucination detection methods are fundamentally flawed when applied to Large Reasoning Models (LRMs), and pioneers the direction of black-box hallucination detection for LRMs.

EMNLP 2025 Main

Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, Yiqun Liu

EditCoT is a novel knowledge editing framework that updates LLMs through iterative chain-of-thought refinement, enabling efficient integration of new knowledge without retraining. It achieves state-of-the-art performance across diverse tasks and languages, offering superior generalization, stability, and effectiveness.

ACL 2025 Findings

Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, Yiqun Liu

DecKER is a novel in-context editing framework that decouples reasoning from knowledge injection, mitigating conflicts between updated and original knowledge. It achieves significant improvements in multi-hop reasoning by preserving reasoning integrity while efficiently integrating new knowledge.

SIGIR-AP 2024

Changyue Wang, Weihang Su, Yiran Hu, Qingyao Ai, Yueyue Wu, Cheng Luo, Yiqun Liu, Min Zhang, Shaoping Ma

LeKUBE is a comprehensive benchmark designed to evaluate knowledge update methods for legal LLMs. It highlights the unique challenges of updating legal knowledge—such as nuanced statutory changes and complex reasoning—revealing a significant gap between current techniques and real-world legal needs.

ACL 2024 Findings

Weihang Su*, Changyue Wang*, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, Yiqun Liu

MIND is an unsupervised framework that detects hallucinations in LLMs by leveraging their internal states during inference for real-time analysis. Alongside, HELM provides a comprehensive benchmark to evaluate hallucination detection across diverse models and scenarios.

Parametric Retrieval Augmented Generation, Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu. SIGIR 2025
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System, Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, Yiqun Liu. SIGIR 2025
Pre-training for Legal Case Retrieval Based on Inter-Case Distinctions, Weihang Su, Qingyao Ai, Yueyue Wu, Anzhe Xie, Changyue Wang, Yixiao Ma, Haitao Li, Zhijing Wu, Yiqun Liu, Min Zhang. ACM TOIS
Mitigating Entity-Level Hallucination in Large Language Models, Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, Yiqun Liu. SIGIR-AP 2024