⛷️ Paper Under Submission

Improve Large Language Model Systems with User Logs
Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu
- UNO (User log-driveN Optimization) is a framework for improving LLM systems from raw user logs. It distills unstructured logs into semi-structured rules and preference pairs, organizes heterogeneous feedback through query-and-rule clustering, and uses cognitive-gap assessment to build primary and reflective experience modules. UNO provides a systematic study of user-log-driven optimization and identifies the Signal-or-Noise Dilemma in feedback-based continual improvement.
📝 Publications

MemoryBench: A Benchmark for Memory and Continual Learning in LLMSystems
Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu
- MemoryBench is a comprehensive benchmark for evaluating memory and continual learning in LLM systems. Unlike prior memory benchmarks that mainly reduce memory evaluation to static long-context reading comprehension, MemoryBench simulates how LLM systems learn from accumulated user feedback during service time. MemoryBench sets a new evaluation setting for LLM memory systems: continual learning from accumulated user feedback during service time.

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu
- RACE (Reasoning and Answer Consistency Evaluation) is a framework for detecting hallucinations in Large Reasoning Models (LRMs) by jointly analyzing both reasoning traces and final answers. It detects inconsistencies and hallucinations through multi-signal analysis, achieving robust and generalizable performance across models and datasets. RACE is the first to reveal that prior black-box hallucination detection methods are fundamentally flawed when applied to Large Reasoning Models (LRMs), and pioneers the direction of black-box hallucination detection for LRMs.

Knowledge Editing through Chain-of-Thought
Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, Yiqun Liu
- EditCoT is a novel knowledge editing framework that updates LLMs through iterative chain-of-thought refinement, enabling efficient integration of new knowledge without retraining. It achieves state-of-the-art performance across diverse tasks and languages, offering superior generalization, stability, and effectiveness.

Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing
Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, Yiqun Liu
- DecKER is a novel in-context editing framework that decouples reasoning from knowledge injection, mitigating conflicts between updated and original knowledge. It achieves significant improvements in multi-hop reasoning by preserving reasoning integrity while efficiently integrating new knowledge.

LeKUBE: A Knowledge Update BEnchmark for Legal Domain
Changyue Wang, Weihang Su, Yiran Hu, Qingyao Ai, Yueyue Wu, Cheng Luo, Yiqun Liu, Min Zhang, Shaoping Ma
- LeKUBE is a comprehensive benchmark designed to evaluate knowledge update methods for legal LLMs. It highlights the unique challenges of updating legal knowledge—such as nuanced statutory changes and complex reasoning—revealing a significant gap between current techniques and real-world legal needs.

Unsupervised real-time hallucination detection based on the internal states of large language models
Weihang Su*, Changyue Wang*, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, Yiqun Liu
- MIND is an unsupervised framework that detects hallucinations in LLMs by leveraging their internal states during inference for real-time analysis. Alongside, HELM provides a comprehensive benchmark to evaluate hallucination detection across diverse models and scenarios.
- Parametric Retrieval Augmented Generation, Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu. SIGIR 2025
- JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System, Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, Yiqun Liu. SIGIR 2025
- Pre-training for Legal Case Retrieval Based on Inter-Case Distinctions, Weihang Su, Qingyao Ai, Yueyue Wu, Anzhe Xie, Changyue Wang, Yixiao Ma, Haitao Li, Zhijing Wu, Yiqun Liu, Min Zhang. ACM TOIS
- Mitigating Entity-Level Hallucination in Large Language Models, Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, Yiqun Liu. SIGIR-AP 2024
Internships
- Research intern, TikTok Data Search, ByteDance, China. (2025.09 - Now)