⛷️ Paper Under Submission

ArXiv
sym

Improve Large Language Model Systems with User Logs

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

Code

  • UNO (User log-driveN Optimization) is a framework for improving LLM systems from raw user logs. It distills unstructured logs into semi-structured rules and preference pairs, organizes heterogeneous feedback through query-and-rule clustering, and uses cognitive-gap assessment to build primary and reflective experience modules. UNO provides a systematic study of user-log-driven optimization and identifies the Signal-or-Noise Dilemma in feedback-based continual improvement.

📝 Publications

ICML 2026 Spotlight
sym

MemoryBench: A Benchmark for Memory and Continual Learning in LLMSystems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu

Code | HF Dataset

  • MemoryBench is a comprehensive benchmark for evaluating memory and continual learning in LLM systems. Unlike prior memory benchmarks that mainly reduce memory evaluation to static long-context reading comprehension, MemoryBench simulates how LLM systems learn from accumulated user feedback during service time. MemoryBench sets a new evaluation setting for LLM memory systems: continual learning from accumulated user feedback during service time.
AAAI 2026 Oral
sym

Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

Code | HF Model

  • RACE (Reasoning and Answer Consistency Evaluation) is a framework for detecting hallucinations in Large Reasoning Models (LRMs) by jointly analyzing both reasoning traces and final answers. It detects inconsistencies and hallucinations through multi-signal analysis, achieving robust and generalizable performance across models and datasets. RACE is the first to reveal that prior black-box hallucination detection methods are fundamentally flawed when applied to Large Reasoning Models (LRMs), and pioneers the direction of black-box hallucination detection for LRMs.
EMNLP 2025 Main
sym

Knowledge Editing through Chain-of-Thought

Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, Yiqun Liu

Code

  • EditCoT is a novel knowledge editing framework that updates LLMs through iterative chain-of-thought refinement, enabling efficient integration of new knowledge without retraining. It achieves state-of-the-art performance across diverse tasks and languages, offering superior generalization, stability, and effectiveness.
ACL 2025 Findings
sym

Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing

Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, Yiqun Liu

Code

  • DecKER is a novel in-context editing framework that decouples reasoning from knowledge injection, mitigating conflicts between updated and original knowledge. It achieves significant improvements in multi-hop reasoning by preserving reasoning integrity while efficiently integrating new knowledge.
SIGIR-AP 2024
sym

LeKUBE: A Knowledge Update BEnchmark for Legal Domain

Changyue Wang, Weihang Su, Yiran Hu, Qingyao Ai, Yueyue Wu, Cheng Luo, Yiqun Liu, Min Zhang, Shaoping Ma

Code

  • LeKUBE is a comprehensive benchmark designed to evaluate knowledge update methods for legal LLMs. It highlights the unique challenges of updating legal knowledge—such as nuanced statutory changes and complex reasoning—revealing a significant gap between current techniques and real-world legal needs.
ACL 2024 Findings
sym

Unsupervised real-time hallucination detection based on the internal states of large language models

Weihang Su*, Changyue Wang*, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, Yiqun Liu

Code

  • MIND is an unsupervised framework that detects hallucinations in LLMs by leveraging their internal states during inference for real-time analysis. Alongside, HELM provides a comprehensive benchmark to evaluate hallucination detection across diverse models and scenarios.

Internships

  • Research intern, TikTok Data Search, ByteDance, China. (2025.09 - Now)