Recent advances in AI have heightened attention on the foundations of evaluation. As models become more performant, traditional metrics and benchmarks increasingly fail to capture meaningful differences in system behavior. Indeed, Voorhees et al. observe that modern retrieval models have saturated high-precision metrics, calling for ''new strategies and tools for building reliable test collections.'' I describe preference-based evaluation, a framework that reinterprets evaluation as an ordering over system behaviors rather than the computation of numeric scores. Although common in laboratory studies and online evaluation, automatic evaluation methods---such as average precision or reciprocal rank---have traditionally lacked preference-based counterparts. Drawing on foundational work in information retrieval evaluation and social-choice theory, I introduce a family of methods for conducting efficient, automatic, preference-based evaluation. Through a series of experiments across retrieval and recommendation tasks, preference-based versions of precision, recall, and average precision all demonstrate substantially higher sensitivity, addressing recent trends of metric saturation.
Multimodal Retrieval Augmented Generation (MRAG) systems have shown promise in enhancing the generation capabilities of multimodal large language models (MLLMs). However, existing MRAG frameworks primarily adhere to rigid, single-step retrieval strategies that fail to address real-world challenges of information acquisition and query reformulation. In this work, we introduce the task of Multimodal Retrieval Augmented Generation Planning (MRAG Planning) that aims at effective information seeking and integration while minimizing computational overhead. Specifically, we propose CogPlanner, an agentic plug-and-play framework inspired by human cognitive processes, which iteratively determines query reformulation and retrieval strategies to generate accurate and contextually relevant responses. CogPlanner supports parallel and sequential modeling paradigms. Furthermore, we introduce CogBench, a new benchmark designed to rigorously evaluate the MRAG Planning task and facilitate lightweight CogPlanner integration with resource-efficient MLLMs, such as Qwen2-VL-7B-Cog. Experimental results demonstrate that CogPlanner significantly outperforms existing MRAG baselines, offering improvements in both accuracy and efficiency with minimal additional computational costs.
Hallucinations present a significant challenge for large language models (LLMs). The utilization of parametric knowledge in generating factual content is constrained by the limited knowledge of LLMs, potentially resulting in internal hallucinations. While incorporating external information can help fill knowledge gaps, it also introduces the risk of irrelevant information, thereby increasing the likelihood of external hallucinations. To balance the use of parametric knowledge within LLMs and external information, in this study, we present Rowen, a novel framework that enhances LLMs with an adaptive retrieval augmentation process tailored to address hallucinated outputs. Rowen introduces a consistency-based hallucination detection module, which assesses the model's uncertainty regarding the input query by evaluating the semantic inconsistencies in various responses generated across different languages or models. When high uncertainties in the responses are detected, Rowen activates the retrieval of external information to rectify the model outputs. Through comprehensive empirical experiments, we demonstrate that Rowen surpasses the current state-of-the-art in both detecting and mitigating hallucinated content within the outputs of LLMs.
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.
Retrieval-augmented generation (RAG) systems traditionally employ sophisticated training strategies to enhance robustness against retrieval noise. In this work, we investigate a critical question: does the benefit of these complex robust training methods diminish as language models become more powerful? Through systematic evaluation across multiple model scales and question-answering datasets, our analysis reveals a consistent trend: the marginal robustness benefit of sophisticated training strategies decreases substantially as model capacity increases. While smaller models show significant performance improvements from complex document selection and adversarial objectives, more capable models achieve comparable or even superior performance with simpler training approaches. Further investigation demonstrates that stronger models naturally exhibit better confidence calibration, cross-dataset generalization capability, and more effective attention patterns, even under simple training regimes. These findings suggest that as foundation models evolve, the engineering effort invested in complex robust training may yield diminishing returns, indicating that simplified RAG pipelines could suffice for powerful models while maintaining competitive performance.
Retrieval-augmented generation (RAG) has been widely adopted to augment large language models (LLMs) with external knowledge for knowledge-intensive tasks. However, its effectiveness is often undermined by the presence of noisy (i.e., low-quality) retrieved passages. Enhancing LLMs' robustness to such noise is critical for improving the reliability of RAG systems. Recent advances have equipped LLMs with strong reasoning and self-reflection capabilities, allowing them to identify and correct errors in their reasoning process. Inspired by this ability, we propose Passage Injection—a simple yet effective method that explicitly incorporates retrieved passages into LLMs' reasoning process, aiming to enhance the model's ability to recognize and resist noisy passages. We validate Passage Injection under general RAG settings using BM25 as the retriever. Experiments on four reasoning-enhanced LLMs across four factual QA datasets demonstrate that Passage Injection significantly improves overall RAG performance. Further analysis on two noisy retrieval settings-random noise, where the model is provided irrelevant passages, and counterfactual noise, where it is given misleading passages-shows that Passage Injection consistently improves robustness. Controlled experiments confirm that Passage Injection can also effectively leverage helpful passages. These findings suggest that incorporating retrieved passages in LLMs' reasoning process is a promising direction for building more robust RAG systems.
Retrieval-Augmented Generation (RAG) has emerged as a crucial approach for enhancing the responses of large language models (LLMs) with external knowledge sources. Despite the impressive performance in complex question-answering tasks, RAG still struggle with hallucinations. Attributing RAG-generated content through in-line citations has demonstrated potential in reducing hallucinations and facilitating human verification. Existing citation generation methods primarily rely on either fine-tuning the generator or employing post-processing approaches for citation matching. However, the former approach demands substantial annotated data and computational resources, while the latter often encounters difficulties in managing multiple citations and frequently produces suboptimal results. In this paper, we introduce a novel framework, called VeriCite, designed to rigorously validate supporting evidence and enhance answer attribution. Specifically, VeriCite breaks down into a three-stage generation: 1) The initial answer generation first generates a response based on all available contexts and has its claims verified through the NLI model; 2) the supporting evidence selection assesses the utility of each document and extracts useful supporting evidences; 3) the final answer refinement integrates the initial response and collected evidences to produce the final, refined answer. We conduct experiments across four open-source LLMs and four datasets, demonstrating that VeriCite can significantly improve citation quality while maintaining the correctness of the answers.
Large Language Models (LLMs) increasingly act as generative information intermediaries (GII) in retrieval-augmented applications, performing tasks like query (re)formulation, document selection, and relevance assessment. However, their responses are highly sensitive to prompt and instruction design, posing challenges for reproducibility, robustness, and evaluation in Information Retrieval (IR). We present geniie-lab, a research testbed for systematically investigating how instructions affect LLM responses across stages of a search pipeline. It supports controlled experiments on query formulation, ranking, and response refinement, with flexible integration of OpenSearch and LLM APIs. The geniie-lab also enables reproducible experiments on prompt variation, task sensitivity, and agentic IR workflows. Released as an open resource, we aim to advance empirical studies of model search behaviour and support LLM-centric IR evaluation. The source code is available from https://github.com/geniie-lab/geniie-lab/
As LLMs become more widely adopted, detecting inconsistencies like bias and hallucination is increasingly important. Auditing LLMs for these inconsistencies is crucial but often challenging. An effective method for auditing an LLM involves using variations of the same question, referred to as probes, where consistent responses to these probes are expected. Deviations in the responses can indicate flaws in the model's knowledge representation or operational behavior. However, producing high-quality probes at scale remains challenging, primarily because it requires human experts to ensure the reliability of the probes. Prior work has relied on human experts to manually verify each individual probe, making the process expensive, resource-intensive, and prone to subjectivity. o address these limitations, we introduce LLMAuditor, a framework that uses a human-in-the-loop (HIL) validated prompt template to guide an LLM in generating probes. This approach eliminates the need for exhaustive human verification of every probe while maintaining high standards of quality and reliability. LLMAuditor operates in two phases: first, a helper LLM generates probes using a HIL-validated prompt template; second, these probes are used to audit the target LLM. This dual-LLM approach ensures verifiability, avoids circular reliance on a single model, and enhances the rigor and generalizability of the auditing process. Case studies on different LLMs show LLMAuditor reliably identifies inconsistencies. The framework's novelty lies in its use of a HIL-validated prompt template for probe generation, which enhances both the transparency and effectiveness of LLM evaluation.
Graph Neural Networks (GNNs) have achieved promising results in numerous applications. To ensure accurate predictions, GNNs may exploit user data beyond the scope of consent, raising significant concerns regarding data usage and privacy. A critical task, therefore, is data usage privacy auditing, which aims to infer whether a data subject's information was improperly used during model training. While Membership Inference Attacks (MIA) are commonly used to audit data usage in fully supervised settings, most GNNs operate under a semi-supervised paradigm, where nodes fall into three distinct categories with varying degrees of involvement in training. This heterogeneity introduces a key challenge for applying traditional binary MIA methods. To address this, we study ternary data usage privacy auditing for semi-supervised GNNs in a black-box setting using a shadow training-based MIA framework. We identify that the main obstacle lies in the effects of message passing, which entangles node representations across categories. To tackle this, we propose MPMIA--a novel Message Passing-aware Membership Inference Attack--which leverages point-wise KL divergence to quantify and exploit the influence of message passing on node embeddings. Extensive experiments validate the effectiveness of MPMIA, showing clear improvements over baseline methods in accurately auditing data usage within semi-supervised GNNs.
Large language models (LLMs) are increasingly deployed in information systems, including being used as second-stage rerankers in information retrieval pipelines, yet their susceptibility to recency bias has received little attention. We investigate whether LLMs implicitly favour newer documents by prepending artificial publication dates to passages in the TREC Deep Learning passage retrieval collections in 2021 (DL21) and 2022 (DL22). Across seven models, GPT-3.5-turbo, GPT-4o, GPT-4, LLaMA-3 8B/70B, and Qwen-2.5 7B/72B, ''fresh'' passages are consistently promoted, shifting the Top-10's mean publication year forward by up to 4.78 years and moving individual items by as many as 95 ranks in our listwise reranking experiments. Although larger models attenuate the effect, none eliminate it. We also observe that the preference of LLMs between two passages with an identical relevance level can be reversed by up to 25% on average after date injection in our pairwise preference experiments. These findings provide quantitative evidence of a pervasive recency bias in LLMs and highlight the importance of effective bias-mitigation strategies.
Current Membership Inference Attack (MIA) against Graph Neural Networks (GNNs) primarily relied on the models' posterior output, rendering them unsuitable for application in real-world scenarios where GNNs usually exhibit label-only output. Inspired by the observation that the distance of a sample to the classification boundary, i.e., the smallest adversarial perturbation distance, can be a proxy of posterior output, we propose to find this distance for MIA on GNNs with label-only output. However, this approach faces two challenges: the intractable computation of non-Euclidean structural distance and the difficulty in integrating it with Euclidean attribute distance. To combat these, we propose Graph Distance Autoencoder (GDA) for calculating such adversarial perturbation distance in a latent space. GDA reconstructs the structure and attribute jointly to obtain a unified measurable distance. To further enhance the learning of latent space distance, GDA introduces a self-supervised task with contrastive learning to align the latent space with the input space. Comprehensive experiments on five representative datasets against two prevalent GNNs verify the superiority of our attack and the effectiveness of each key module.
With the increasing emphasis on improving group fairness in Information Retrieval (IR), shared tasks, such as TREC Fair Ranking tracks and NTCIR FairWeb tasks have emerged to encourage the community to develop more group-fairness-aware IR systems. However, these shared tasks have adopted different group fairness evaluation measures. TREC 2021 and 2022 Fair Ranking tracks employed Attention-Weighted Rank Fairness (AWRF) in conjunction with nDCG, whereas the NTCIR-17 FairWeb-1 task and the NTCIR-18 FairWeb-2 task used Group Fairness and Relevance (GFR). These measures differ significantly in how they quantify group fairness, and the implications of choosing one over the other remain underexplored. In this paper, we use data from the NTCIR FairWeb tasks to compare the two evaluation measures. We first apply AWRF to re-evaluate the submitted runs of the NTCIR-17 FairWeb-1 and the NTCIR-18 FairWeb-2 tasks, and compare them with the original evaluation results in terms of rank correlation. We further investigate the measures' difference in discriminative power, and test their robustness to system bias. Our analysis reveals how metric choice can influence group fairness assessments and provides practical insights for researchers aiming to incorporate group fairness into IR evaluation.
Ensuring fair and accurate recognition of contributors is critical for the health and growth of online communities; yet Expert Finding (EF) methods on Community Question Answering (CQA) platforms often reinforce existing inequalities. This paper investigates the prevalence and amplification of individual biases in EF methods on (CQA) platforms by introducing a framework which assesses the degree of bias in various attributes in the list of recommended experts. Our work shows that, without corrective measures, automated recommenders risk silencing diverse but less-visible voices; undermining both equity and answer quality. By conducting extensive experiments on multiple Stack Overflow datasets, we analyze how state-of-the-art EF methods reinforce individual bias. Our findings show that EF methods frequently amplify biases, such as prioritizing highly active users over less active participants who may offer higher-quality answers. We also examine how these biases contribute to the recommendation of inaccurate experts, offering a thorough evaluation of the resulting negative impact on the effectiveness and accuracy of CQA platforms. All code and datasets used in this study are publicly available.
We introduce a data-centric approach for mitigating presentation bias in real-time neural query autocomplete systems through the use of synthetic prefixes. These prefixes are generated from complete user queries collected during regular search sessions where autocomplete was not active. This allows us to enrich the training data for learning to rank models with more diverse and less biased examples. This method addresses the inherent bias in engagement signals collected from live query autocomplete interactions, where model suggestions influence user behavior. Our neural ranker is optimized for real-time deployment under strict latency constraints and incorporates a rich set of features, including query popularity, seasonality, fuzzy match scores, and contextual signals such as department affinity, device type, and vertical alignment with previous user queries. To support efficient training, we introduce a task-specific simplification of the listwise loss, reducing computational complexity from O(n²) to O(n) by leveraging the query autocomplete structure of having only one ground-truth selection per prefix. Deployed in a large-scale e-commerce setting, our system demonstrates statistically significant improvements in user engagement, as measured by mean reciprocal rank and related metrics. Our findings show that synthetic prefixes not only improve generalization but also provide a scalable path toward bias mitigation in other low-latency ranking tasks, including related searches and query recommendations.
Using natural language to query databases has become increasingly popular for making data access more intuitive. While Large Language Models (LLMs) offer a scalable approach for Natural Language-to-SQL (NL2SQL) translation, they often struggle with unfamiliar schemas and domain-specific logic in enterprise environments. In this work, we present a data-centric, retrieval-augmented inference framework that improves SQL generation by supplying LLMs with curated examples drawn from prior workloads. We introduce the concept of a Query Capsule (QC), a structured unit that encapsulates a query context—a natural language description of SQL intent—and a SQL template—a query skeleton with typed placeholders. To evaluate the benefit of providing QCs, we propose Merit, a novel metric that captures structural and semantic improvements in SQL generation when QCs are included. Our experiments on the Spider and BIRD benchmarks demonstrate that the inclusion of QCs improves execution accuracy by 4.58% and 5.79%, and Merit score by 5.67% and 8.24%, respectively. Furthermore, a real-world case study on the TPC-H benchmark shows that QC-based retrieval enhances execution accuracy by an average of 2.63 times. These findings validate the practical usefulness of QCs and the importance of data-centric approach for robust enterprise NL2SQL systems.
Task-oriented conversational systems are essential for efficiently addressing diverse user needs, yet their development requires substantial amounts of high-quality conversational data that is challenging and costly to obtain. While large language models (LLMs) have demonstrated potential in generating synthetic conversations, the extent to which these agent-generated interactions can effectively substitute real human conversations remains unclear. This work presents the first systematic comparison between LLM-simulated users and human users in personalized task-oriented conversations. We propose a comprehensive analytical framework encompassing three key aspects (conversation strategy, interaction style, and conversation evaluation) and ten distinct dimensions for evaluating user behaviors, and collect parallel conversational datasets from both human users and LLM agent users across four representative scenarios under identical conditions.
Our analysis reveals significant behavioral differences between the two user types in problem-solving approaches, question broadness, user engagement, context dependency, feedback polarity and promise, language style, and hallucination awareness. We found consistency in the agent users and human users across the depth-first or breadth-first dimensions, as well as the usefulness dimensions. These findings provide critical insights for advancing LLM-based user simulation. Our multi-dimensional taxonomy constructed a generalizable framework for analyzing user behavior patterns, offering insights from LLM agent users and human users. By this work, we provide perspectives on rethinking how to use user simulation in conversational systems in the future. Code and data are available at https://github.com/wzf2000/RecLLMSim/tree/Human_Vs_Agent.
Genomic question answering often requires complex reasoning and integration across diverse biomedical sources. GeneGPT addressed this challenge by combining domain-specific APIs with OpenAI's code-davinci-002 large language model to enable natural language interaction with genomic databases. However, its reliance on a proprietary model limits scalability, increases operational costs, and raises concerns about data privacy and generalization. In this work, we revisit and reproduce GeneGPT in a pilot study using open source models, including Llama 3.1, Qwen2.5, and Qwen2.5 Coder, within a monolithic architecture; this allows us to identify the limitations of this approach. Building on this foundation, we then develop OpenBioLLM, a modular multi-agent framework that extends GeneGPT by introducing agent specialization for tool routing, query generation, and response validation. This enables coordinated reasoning and role-based task execution. OpenBioLLM matches or outperforms GeneGPT on over 90% of the benchmark tasks, achieving average scores of 0.849 on GeneTuring and 0.830 on GeneHop, while using smaller open-source models without additional fine-tuning or tool-specific pretraining. OpenBioLLM's modular multi-agent design reduces latency by 40–50% across benchmark tasks, significantly improving efficiency without compromising model capability. The results of our comprehensive evaluation highlight the potential of open-source multi-agent systems for genomic question answering. Code and resources are available at https://github.com/ielab/OpenBioLLM.
In real-world recommender systems, user-item interactions are Missing Not At Random (MNAR), as interactions with popular items are more frequently observed than those with less popular ones. Missing observations shift recommendations toward frequently interacted items, which reduces the diversity of the recommendation list. To alleviate this problem, Inverse Propensity Scoring (IPS) is widely used and commonly models propensities based on a power-law function of item interaction frequency. However, we found that such power-law-based correction overly penalizes popular items and harms their recommendation performance. We address this issue by redefining the propensity score to allow broader item recommendation without excessively penalizing popular items. The proposed score is formulated by applying a sigmoid function to the logarithm of the item observation frequency, maintaining the simplicity of power-law scoring while allowing for more flexible adjustment. Furthermore, we incorporate the redefined propensity score into a linear autoencoder model, which tends to favor popular items, and evaluate its effectiveness. Experimental results revealed that our method substantially improves the diversity of items in the recommendation list without sacrificing recommendation accuracy. The source code of our experiments is available on GitHub at https://github.com/cars1015/IPS-LAE.
Multi-hop question answering (QA) models excel at decomposing complex queries into sequential reasoning steps, yet they remain vulnerable to subtly flawed inference chains that appear reasonable but are factually incorrect. To quantify and address this weakness, we present FalseCoTQA, an adversarial benchmark that injects knowledge-grounded false reasoning into retrieval-augmented contexts. Unlike prior methods that merely tweak surface text, FalseCoTQA leverages a domain-agnostic knowledge graph to systematically replace entities to construct semantically coherent yet incorrect chains of thought on top of standard multi-hop datasets (HotpotQA and MuSiQue). By evaluating state-of-the-art language models on this benchmark, we observe dramatic drops in answer accuracy, highlighting their tendency to follow deceptive reasoning without verifying factual consistency. We expect the proposed benchmark to contribute to the evaluation and improvement of the robustness and reliability of language models in multi-hop question answering.
Math information retrieval systems aim to find relevant documents regarding the user's mathematical information needs. Mathematical notation often involves symbols with multiple meanings, making it difficult for search engines to accurately interpret user intent. While general search engines struggle with ambiguous and incomplete queries, the complexity increases in the mathematical domain due to this notational ambiguity and the technical nature of mathematics, requiring precise interpretation. To address these challenges, query rewriting techniques are crucial in math information retrieval systems. This paper explores math query rewriting approaches leveraging the contextual understanding capabilities of large language models. The techniques include clarification, simplification, summarization, and using large language models to generate potential answers, which are then used as search queries. The experiments on the ARQMath test collections showed that rewritten math queries with generated answer significantly improve effectiveness across all test collections. Furthermore, each technique can be useful depending on how the original math question is formulated.
With the deepening of research into LLMs, it is the right time to understand the similarities and distinctions between LLMs and human users. This talk addresses several questions from a user-centric viewpoint in information access tasks: How can we evaluate the performance of large models, and what is their efficacy? To what extent do LLMs' conversational behaviors differ from those of humans in IR tasks? How does their capacity for test-time learning from conversational reasoning experiences compare to that of humans? Some of our recent explorations and findings on these questions will also be presented. We hope that discussions on the related topics will offer some new perspectives and inspire future research into the behavior and reasoning mechanisms of LLMs in information access tasks.
In information retrieval, large language models (LLMs) have demonstrated remarkable potential in text reranking tasks by leveraging their sophisticated natural language understanding and advanced reasoning capabilities. However, conventional supervised fine-tuning approaches for specializing LLMs in ranking tasks often lead to significant degradation of the models' general-purpose abilities. To address this fundamental challenge, this paper presents a novel methodology that strategically combines Chain-of-Thought (CoT) prompting techniques with an innovative two-stage training pipeline consisting of Supervised Fine-Tuning followed by Ranking Preference Optimization (SFT-RPO). The Chain-of-Thought prompting component encourages models to explicitly articulate their reasoning process during ranking decisions, creating a transparent pathway from query-document analysis to final ranking scores while maintaining analytical capabilities throughout fine-tuning. Extensive experimental evaluations on the TREC Deep Learning datasets demonstrate that our proposed method achieves superior performance compared to existing state-of-the-art models, including RankZephyr, showing consistent improvements across multiple evaluation metrics such as normalized Discounted Cumulative Gain (nDCG). Most significantly, comprehensive assessments on the Massive Multitask Language Understanding (MMLU) benchmark reveal that our method successfully maintains robust performance across diverse reasoning tasks, providing strong empirical evidence for effective retention of general-purpose capabilities through strategic fine-tuning while achieving specialized performance improvements in text reranking. Code available https://github.com/JamesL404/RaCT/tree/main.
Neural ranking models have shown outstanding performance across a variety of tasks, such as document retrieval, re-ranking, question answering and conversational retrieval. However, the inner decision process of these models remains largely unclear, especially as models increase in size. Most interpretability approaches, such as probing, focus on correlational insights rather than establishing causal relationships. The paper 'Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models' by Chen et al. addresses this gap by introducing a framework for activation patching - a causal interpretability method - in the information retrieval domain, offering insights into how neural retrieval models compute document relevance. The study demonstrates that neural ranking models not only capture term-frequency information, but also that these representations can be localized to specific components of the model, such as individual attention heads or layers. This paper aims to reproduce the findings by Chen et al. and to further explore the presence of pre-defined retrieval axioms in neural IR models. We validate the main claims made by Chen et al., and extend the framework to include an additional term-frequency axiom, which states that the impact of increasing query term frequency on document ranking diminishes as the frequency becomes higher. We successfully identify a group of attention heads that encode this axiom and analyze their behavior to give insight into the inner decision-making process of neural ranking models.
Current neural re-rankers often struggle with complex information needs and long, content-rich documents. The fundamental issue is not computational--it is intelligent content selection: identifying what matters in lengthy, multi-faceted texts. While humans naturally anchor their understanding around key entities and concepts, neural models process text within rigid token windows, treating all interactions as equally important and missing critical semantic signals. We introduce REGENT, a neural re-ranking model that mimics human-like understanding by using entities as a ''semantic skeleton'' to guide attention. REGENT integrates relevance guidance directly into the attention mechanism, combining fine-grained lexical matching with high-level semantic reasoning. This relevance-guided attention enables the model to focus on conceptually important content while maintaining sensitivity to precise term matches. REGENT achieves new state-of-the-art performance in three challenging datasets, providing up to 108% improvement over BM25 and consistently outperforming strong baselines including ColBERT and RankVicuna. To our knowledge, this is the first work to successfully integrate entity semantics directly into neural attention, establishing a new paradigm for entity-aware information retrieval.
While neural ranking models (NRMs) have shown high effectiveness, they remain susceptible to adversarial manipulation. In this work, we introduce Few-Shot Adversarial Prompting (FSAP), a novel black-box attack framework that leverages the in-context learning capabilities of Large Language Models (LLMs) to generate high-ranking adversarial documents. Unlike previous approaches that rely on token-level perturbations or manual rewriting of existing documents, FSAP formulates adversarial attacks entirely through few-shot prompting, requiring no gradient access or internal model instrumentation. By conditioning the LLM on a small support set of previously observed harmful examples, FSAP synthesizes grammatically fluent and topically coherent documents that subtly embed false or misleading information and rank competitively against authentic content. We instantiate FSAP in two modes: FSAPIntraQ, which leverages harmful examples from the same query to enhance topic fidelity, and FSAPInterQ, which enables broader generalization by transferring adversarial patterns across unrelated queries. Our experiments on the TREC 2020 and 2021 Health Misinformation Tracks, using four diverse neural ranking models, reveal that FSAP-generated documents consistently outrank credible, factually accurate documents. Furthermore, our analysis demonstrates that these adversarial outputs exhibit strong stance alignment and low detectability, posing a realistic and scalable threat to neural retrieval systems. FSAP also effectively generalizes across both proprietary and open-source LLMs.
We propose a suite of similarity measures for comparing two ranked lists which may contain nonoverlapping items and may even differ in size: we collectively call them SimPD (Similarity based on Promotion and Demotion magnitudes). As the name suggests, SimPD directly quantifies how many ranks each overlapping item has moved up or down in the list in a simple and intuitive manner. Like Rank-Biased Overlap (RBO), a version of SimPD (called SimPD-tA) possesses top-heaviness; unlike RBO, SimPD gives the maximum possible score of 1 if the two ranked lists are identical. Furthermore, a version of SimPD (called SimPD-F) focusses on a subset of the overlapping items when quantifying item promotions and demotions: for example, we can quantify the promotion/demotion magnitudes of overlapping relevant documents given two Search Engine Result Pages (SERPs). We demonstrate the properties of SimPD through theorem proofs as well as demonstrations with web search task data from NTCIR. More specifically, we show how SimPD can be used in a document-level search engine reproducibility study, and for comparing SERPs before and after search result diversification. In both experiments, we compare SimPD with RBO as well as new measures that we define called KTbU (Kendall's τ-b Union) and SRU (Spearman's ρ Union) to highlight the uniqueness of SimPD.
In this paper, we propose the task of Ranked Video Moment Retrieval (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and studied by CV, NLP, and IR communities, RVMR is the task that best reflects the practical setting of moment search. To facilitate research in RVMR, we develop the TVR-Ranking dataset, based on the raw videos and existing moment annotations provided in the TVR dataset. Our key contribution is the manual annotation of relevance levels for 94,442 query-moment pairs. We then develop the NDCG@K, IoU ≥ μ evaluation metric for this new task and conduct experiments to evaluate three baseline models. Our experiments show that the new RVMR task brings new challenges to existing models and we believe this new dataset contributes to the research on multi-modality search. The dataset is available at https://github.com/Ranking-VMR/TVR-Ranking.
Cross-modal retrieval plays a fundamental role in bridging distinct information sources, such as text, images, and videos. Different from the traditional methods that predominantly rely on discriminative matching via cross-attention or joint embedding spaces, generative cross-modal retrieval has recently emerged as a new paradigm. Despite the improvements achieved by recent studies, there are still many open questions. For instance, given the conventional setting that representing each candidate item by a unique identifier and expanding the prefix by one token at a time in a greedy manner during the constrained beam search process, once the prefix of a relevant item's identifier is pruned, it becomes impossible for that item to appear in the final result list, leading to the risk of getting stuck in a local optimum. In this paper, we develop a combination-based framework for generative cross-modal retrieval. Specifically, we not only explore the effectiveness of imposing dual identifiers during the constrained beam search process, but also investigate the benefits of combining different retrieval strategies so as to mitigate the information loss caused by the discrete representations. Based on two benchmark collections, our extensive empirical experiments reveal that: (1) Compared with the state-of-the-art generative text-image retrieval method, our proposed approach based on dual identifiers achieves substantially improved performance. Although generative retrieval with discrete identifiers offers higher efficiency, it still falls significantly short of dense embedding-based retrieval in terms of performance. To overcome this limitation, we propose a hybrid strategy that performs initial generative retrieval followed by reranking the top-k candidates using dense embeddings, resulting in notable improvements in both retrieval performance and efficiency. (2) The factors, such as the choice of base LLMs, the number of top candidate items to rerank, the beam size and the way of combining ranking strategies significantly influence the retrieval performance. Careful examinations of these factors are highly recommended in the development of generative text-image retrieval methods.
Recent dense retrievers increasingly leverage the robust text understanding capabilities of Large Language Models (LLMs), encoding queries and documents into a shared embedding space for effective retrieval. However, most existing methods represent each document with a single embedding, which is less effective at capturing its multifaceted semantics and thereby limits matching accuracy. In this paper, we propose Deliberate Thinking based Dense Retriever (Debater), a novel approach that enhances document representations by incorporating a step-by-step thinking process. Debater introduces a Chain-of-Deliberation mechanism, which iteratively refines document embeddings through a continuous chain of thought. To integrate information from various thinking steps, Debater further employs a Self Distillation mechanism that identifies and fuses the most informative steps into a unified embedding. Experimental results show that Debater significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes and datasets are available at https://github.com/OpenBMB/DEBATER.
State-of-the-art approximate nearest neighbor (ANN) methods like HNSW and LADR use document-document proximity graphs (also known as corpus graphs) to identify relevant documents efficiently. Complete graph construction latency (though built offline) has a quadratic time complexity of the number of documents, which is a major hurdle when scaling these methods. Graph approximations are popular ways to reduce the computational cost of building such corpus graphs. However, approximations come with a cost, namely, a lower quality of corpus graphs. Hence, there is a practical need to understand the tradeoffs between a corpus graph's quality and its effectiveness when used with various ANN methods; in other words, how 'approximate' can a corpus graph be while maintaining strong retrieval effectiveness? We construct approximate (i.e. poorer quality) corpus graphs using various methods and present extensive experiments that analyze the robustness and performance of popular ANN methods on these graphs. Our analysis is performed on multiple datasets, with different parameters and various poor graph simulation strategies. We also analyze different graph traversal approaches for robust and efficient retrieval across graphs of poor quality. We conclude by addressing the utility of these approaches at the billion-scale, practical scenarios by optimizing graph construction and graph traversal stages. We show that robust ANN methods like Adaptive LADR show statistically equivalent performance on poor quality graphs while saving 33% graph construction time.
Video moment search, the process of finding relevant moments in a video corpus to match a user's query, is crucial for various applications. Existing solutions, however, often assume a single perfect matching moment, struggle with inefficient inference, and have limitations with hour-long videos. This paper introduces a flexible and scalable framework for retrieving a ranked list of moments from collection of videos in any length to match a text query, a task termed Ranked Video Moment Retrieval (RVMR). Our framework, called Segment-Proposal-Ranking (SPR), simplifies the search process into three independent stages: segment retrieval, proposal generation, and moment refinement with re-ranking. Specifically, videos are divided into equal-length segments with precomputed embeddings indexed offline, allowing efficient retrieval regardless of video length. For scalable online retrieval, both segments and queries are projected into a shared feature space to enable approximate nearest neighbor (ANN) search. Retrieved segments are then merged into coarse-grained moment proposals. Then a refinement and re-ranking module is designed to reorder and adjust timestamps of the coarse-grained proposals. Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves state-of-the-art performance with significant reductions in computational cost and processing time. The flexible design also allows for independent improvements to each stage, making SPR highly adaptable for large-scale applications. Codes are available at https://github.com/Ranking-VMR/SPR
Document-at-a-time query processing can be accelerated through the use of dynamic pruning mechanisms. In this empirical study we measure query time as a function of three numeric and three categorical facets, and infer relationships that allow models of computation time to be established. Using different-sized subsets of three collections, three retrieval models, and three pruning techniques, we quantify the way in which all of collection size, number of documents retrieved, and query length affect query execution times. Despite variations across pruning mechanisms, we find that in combination document retrieval is linear in the collection size when combined with retrieval depth, across all categorical dimensions. Our results allow selection of query processing techniques for specific search tasks, with the choice influenced by collection size, query length, and number of documents retrieved.
Designing document identifiers (docids) that carry rich semantic information while maintaining tractable search spaces is a important challenge in generative retrieval (GR). Popular codebook methods address this by building a hierarchical semantic tree and constraining generation to its child nodes, yet their numeric identifiers cannot leverage the large language model's pretrained natural language understanding. Conversely, using text as docid provides more semantic expressivity but inflates the decoding space, making the system brittle to early-step errors. To resolve this trade-off, we propose C2T-ID:(i) first construct semantic numerical docid via hierarchical clustering; (ii) then extract high-frequency metadata keywords and iteratively replace each numeric label with its cluster's top-K keywords; and (iii) an optional two-level semantic smoothing step further enhances the fluency of C2T-ID. Experiments on Natural Questions and Taobao's product search demonstrate that C2T-ID significantly outperforms atomic, semantic codebook, and pure-text docid baselines, demonstrating its effectiveness in balancing semantic expressiveness with search space constraints.
The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer (Span Representation Cumulation for Long-Context Transformer)—a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk's representation with structural awareness of its position within the document. It achieves this by enriching each chunk's boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document's narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.
Recommender systems frequently encounter data sparsity issues, particularly when addressing cold-start scenarios involving new users or items. Multi-source cross-domain recommendation (CDR) addresses these challenges by transferring valuable knowledge from multiple source domains to enhance recommendations in a target domain. However, existing reinforcement learning (RL)-based CDR methods typically rely on a single-agent framework, leading to negative transfer issues caused by inconsistent domain contributions and inherent distributional discrepancies among source domains. To overcome these limitations, MARCO, a Multi-Agent Reinforcement Learning-based Cross-Domain recommendation framework, is proposed. It leverages cooperative multi-agent reinforcement learning, where each agent is dedicated to estimating the contribution from an individual source domain, effectively managing credit assignment and mitigating negative transfer. In addition, an entropy-based action diversity penalty is introduced to enhance policy expressiveness and stabilize training by encouraging diverse agents' joint actions. Extensive experiments across four benchmark datasets demonstrate MARCO's superior performance over state-of-the-art methods, highlighting its robustness and strong generalization capabilities. The code is at https://github.com/xiewilliams/MARCO.
Recent studies have demonstrated the vulnerability of sequential recommender systems to Model Extraction Attacks (MEAs). MEAs generally exploit user sequential data along with its recommendations to train a surrogate model that mimics the behavior of the original system, enabling unauthorized deployments and threatening privacy and security. However, in real-world scenarios, such data is inaccessible due to authorization restrictions. With data-free setting, MEAs usually rely on random sampling for data selection, which leads to misaligned synthetic and real-world distributions, greatly limiting the effectiveness of the attack. To overcome this limitation, we propose LLM4MEA, a novel framework based on an LLM-driven agent that interacts continuously with the target recommender system to synthesize training data. Specifically, the agent analyzes historical sequences to understand user behavior and selects items from recommendations with consistent preferences. It then autoregressively extends the sequence, creating synthetic data for MEA. Furthermore, to enhance efficiency and ensure consistent performance, we introduce the Memory Compression module to reduce sequential histories, and the Preference Stabilization module to guide the agent's decisions. Extensive experiments demonstrate that LLM4MEA significantly outperforms existing approaches in data quality and attack performance. We also assess how the target system's hyperparameters affect MEA and suggest a simple defense to reduce these risks. The aim of this work is to raise awareness of the security and privacy risks of MEAs in recommender systems.
Research and development on conversational recommender systems (CRSs) critically depends on sound and reliable evaluation methodologies. However, the interactive nature of these systems poses significant challenges for automatic evaluation. This paper critically examines current evaluation practices and identifies two key limitations: the over-reliance on static test collections and the inadequacy of existing evaluation metrics. To substantiate this critique, we analyze real user interactions with nine existing CRSs and demonstrate a striking disconnect between self-reported user satisfaction and performance scores reported in prior literature. To address these limitations, this work explores the potential of user simulation to generate dynamic interaction data, offering a departure from static datasets. Furthermore, we propose novel evaluation metrics, based on a general reward/cost framework, designed to better align with real user satisfaction. Our analysis of different simulation approaches provides valuable insights into their effectiveness and reveals promising initial results, showing improved correlation with system rankings compared to human evaluation. While these findings indicate a significant step forward in CRS evaluation, we also identify areas for future research and refinement in both simulation techniques and evaluation metrics.
Bundle recommender systems merely learn from existing bundles, but obtaining large-scale, high-quality bundle datasets remains a challenge, especially for platforms newly adopting bundle services. Bundle construction is the task of automatically selecting a set of compatible items to form a coherent bundle, a vital step before making recommendations on bundle-aware platforms. Groundbreaking work on bundle construction, like CLHE, has been designed solely on user-item interaction and self-attention modules to learn item/bundle representations. These techniques fall short of the standards for coherent bundles in real-world applications, where the relation among the semantic information of items should be considered more thoroughly. To address these challenges, we explicitly leverage category-wise information and employ cross-modal fusion to enhance item representations. By doing so, we propose Caro: Dual-Enhanced Item Representation for Bundle Construction via Category-Wise and Cross-Modality Learning. Caro captures the inherent relationships between items within analogous categories, improving bundle coherence. It comprises three main components: (1) cross-modality enhanced item representation, (2) category-enhanced item representation, and (3) bundle contrastive learning. Extensive experiments and detailed analyzes using multiple real-world datasets demonstrate that our method outperforms existing state-of-the-art techniques and provides valuable insight into the bundle construction problem. Notably, Caro achieves a 5-8% higher Recall@20 than the strongest baseline, underscoring its performance gains through dual category-wise and cross-modal enhancements. Our repository is available at https://github.com/L2R-UET/CaRo.
Long-context inputs in large language models (LLMs) often suffer from the ''lost in the middle'' problem, where critical information becomes diluted or ignored due to excessive length. Context compression methods aim to address this by reducing input size, but existing approaches struggle with balancing information preservation and compression efficiency. We propose Adaptive Task-Aware Compressor (ATACompressor), which dynamically adjusts compression based on the specific requirements of the task. ATACompressor employs a selective encoder that compresses only the task-relevant portions of long contexts, ensuring that essential information is preserved while reducing unnecessary content. Its adaptive allocation controller perceives the length of relevant content and adjusts the compression rate accordingly, optimizing resource utilization. We evaluate ATACompressor on three QA datasets—HotpotQA, MSMARCO, and SQUAD—showing that it outperforms existing methods in terms of both compression efficiency and task performance. Our approach provides a scalable solution for long-context processing in LLMs. Furthermore, we perform a range of ablation studies and analysis experiments to gain deeper insights into the key components of ATACompressor. Our code is available at https://github.com/Cocobalt/ATACompressor.git.
Personalized Conversational Information Retrieval (CIR) has seen rapid progress in recent years, driven by the development of Large Language Models (LLMs). Personalized CIR aims to enhance document retrieval by leveraging user-specific information, such as preferences, knowledge, or constraints, to tailor responses to individual needs. A key resource developed for this task is the TREC iKAT 2023 dataset, designed to evaluate the integration of personalization into CIR pipelines. Building on this resource, Mo et al. explored several strategies for incorporating Personal Textual Knowledge Bases (PTKB) into LLM-based query reformulation. Their findings suggested that personalization from PTKB could be detrimental and that human annotations were often noisy. However, these conclusions were based on single-run experiments using the commercial GPT-3.5 Turbo model, raising concerns about output variability and repeatability. In this reproducibility study, we rigorously reproduce and extend their work, with a focus on LLM output variability and model generalization. We apply the original methods to the newly released TREC iKAT 2024 dataset, and evaluate a diverse range of models, including Llama (1B to 70B), Qwen-7B, and closed-source models like GPT-3.5 and GPT-4o-mini. Our results show that human-selected PTKBs consistently enhance retrieval performance, while LLM-based selection methods do not reliably outperform manual choices. We further compare variance across datasets and observe substantially higher variability on iKAT than on CAsT, highlighting the challenges of evaluating personalized CIR. Notably, recall-oriented metrics exhibit lower variance than precision-oriented ones, a critical insight for first-stage retrievers, not addressed in the original study. Finally, we underscore the need for multi-run evaluations and variance reporting when assessing LLM-based CIR systems, especially in dense and sparse retrieval or in-context learning settings. By broadening the scope of evaluation across models, datasets, and metrics, our study contributes to more robust and generalizable practices for personalized CIR.
This study investigates the impact of including LLM-rewritten documents and LLM-expanded queries in training data on the performance and ranking behavior of retrieval models for ad-hoc retrieval tasks. Specifically, we train models using various training datasets that combine human-written documents with LLM-rewritten ones, and human-generated queries with LLM-expanded ones. Hereafter, we refer to the latter types as LLM-modified documents and LLM-modified queries. This allows us to examine: (1) the performance difference between models trained on LLM-modified documents and those trained on human-generated documents; (2) the tendency of models trained with LLM-modified documents to rank specific document types higher; (3) the retrieval performance on human-generated queries for models trained using LLM-modified queries; and (4) the ranking preference of models trained with LLM-modified queries. Experimental results show that training with LLM-modified documents generally yields retrieval performance comparable to models trained solely on human-generated documents. However, regarding ranking behavior, models trained on LLM-modified documents exhibit a clear tendency to rank LLM-modified documents higher within mixed corpora. When training with LLM-modified queries, performance on human-generated queries degrades, possibly due to the mismatch in query distributions. We found that this source bias is similarly introduced when training with LLM-modified queries.
Text clustering serves as a fundamental technique for organizing and interpreting unstructured textual data, particularly in contexts where manual annotation is prohibitively costly. With the rapid advancement of Large Language Models (LLMs) and their demonstrated effectiveness across a broad spectrum of NLP tasks, an emerging body of research has begun to explore their potential in the domain of text clustering. However, existing LLM-based approaches still rely on fine-tuned embedding models and sophisticated similarity metrics, rendering them computationally intensive and necessitating domain-specific adaptation. To address these limitations, we propose a novel framework that reframes text clustering as a classification task by harnessing the in-context learning capabilities of LLMs. Our framework eliminates the need for fine-tuning embedding models or intricate clustering algorithms. It comprises two key steps: first, the LLM is prompted to generate a set of candidate labels based on the dataset and then merges semantically similar labels; second, it assigns the most appropriate label to each text sample. By leveraging the advanced natural language understanding and generalization capabilities of LLMs, the proposed approach enables effective clustering with minimal human intervention. Experimental results on diverse datasets demonstrate that our framework achieves comparable or superior performance to state-of-the-art embedding-based clustering techniques, while significantly reducing computational complexity and resource requirements. These findings underscore the transformative potential of LLMs in simplifying and enhancing text clustering tasks. We make our code available to the public for utilization at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM.
The modern information environment (MIE) is increasingly complex, shaped by a wide range of techniques designed to satisfy users' information needs. Information seeking (IS) models are effective mechanisms for characterizing user-system interactions. However, conceptualizing a model that fully captures the MIE landscape poses a challenge. We argue: Does such a model exist? To address this, we propose the Information Seeking in Modern Information Environments (ISMIE) framework as a fundamental step. ISMIE conceptualizes the information seeking process (ISP) via three key concepts: Components (e.g., Information Seeker), Intervening Variables (e.g., Interactive Variables), and Activities (e.g., Acquiring). Using ISMIE's concepts and employing a case study based on a common scenario - misinformation dissemination – we analyze six existing IS and information retrieval (IR) models to illustrate their limitations and the necessity of ISMIE. We then show how ISMIE serves as an actionable framework for both characterization and experimental design. We characterize three pressing issues and then outline two research blueprints: a user-centric, industry-driven experimental design for the authenticity and trust crisis to AI-generated content and a system-oriented, academic-driven design for tackling dopamine-driven content consumption. Our framework offers a foundation for developing IS and IR models to advance knowledge on understanding human interactions and system design in MIEs.
Personalized AI agents are becoming central to modern information retrieval, yet most evaluation methodologies remain static — relying on fixed benchmarks and one-off metrics that fail to reflect how users' needs evolve over time. These limitations hinder our ability to assess whether agents can meaningfully adapt to individuals across dynamic, longitudinal interactions. In this perspective paper, we propose a conceptual lens for rethinking evaluation in adaptive personalization — shifting the focus from static performance snapshots to interaction-aware, evolving assessments. We organize this lens around three core components: (1) persona-based user simulation with temporally evolving preference models; (2) structured elicitation protocols inspired by reference interviews to extract preferences in context; and (3) adaptation-aware evaluation mechanisms that measure how agent behavior improves across sessions and tasks. While recent works have embraced LLM-driven user simulation, we situate this practice within a broader paradigm for evaluating agents over time. To illustrate our ideas, we conduct a case study in e-commerce search using the PersonalWAB dataset. Beyond presenting a framework, our work lays a conceptual foundation for understanding and evaluating personalization as a continuous, user-centric endeavor.
The Text-to-Table task aims to generate format-free tables that convey key information from unstructured text without designated headers. The challenge lies not only in extracting the key information but also in creating appropriate table headers and accurately populating the table with the extracted information. Meanwhile, large language models (LLMs) have shown great success as multi-lingual general task solvers, theoretically enabling text-to-table capabilities in many languages. However, current text-to-table benchmarks are all English-centric, which limits the research in non-English languages. In this paper, we propose CT-Bench, a Chinese text-to-table dataset with 86.6K samples, to benchmark LLMs on this task. Our analysis of the current English text-to-table benchmarks highlights limitations in data diversity and data hallucination. Inspired by this, CT-Bench selects a popular Chinese multidisciplinary online encyclopedia as the source, covering 28 domains to ensure data diversity. To minimize data hallucination, we first train an LLM to filter out samples with hallucination, then employ human annotators to clean the validation and testing sets. Using CT-Bench, we evaluate the performance of open-source and closed-source LLMs. Our results reveal that zero-shot LLMs, including GPT-4 and DeepSeek families, still have a significant performance gap compared to human judgment. However, after fine-tuning on the training data of CT-Bench, open-source LLMs can significantly improve their text-to-table abilities, outperforming GPT-4 and DeepSeek families by a large margin.
Code search is an important information retrieval application. Benefits of better code search include faster new developer on-boarding, reduced software maintenance, and ease of understanding for large repositories. Despite improvements in search algorithms and search benchmarks, the domain of code search has lagged behind. One reason is the high cost of human annotation for code queries and answers. While humans may annotate search results in general text QA systems, code annotations require specialized knowledge of a programming language (PL), as well as domain specific software engineering knowledge. In this work we study the use of Large Language Models (LLMs) to retrieve code at the level of functions and to generate annotations for code search results. We compare the impact of the retriever representation (sparse vs. semantic), programming language, and LLM by comparing human annotations across several popular languages (C, Java, Javascript, Go, and Python). We focus on repositories that implement common data structures likely to be implemented in any PLs. For the same human annotations, we compare several LLM-as-a-Judge models to evaluate programming language and other affinities between ?LM s. We find that the chosen retriever and PL exhibit affinities that can be leveraged to improve alignment of human and AI relevance determinations, with significant performance implications. We also find differences in representation (sparse vs. semantic) across PLs that impact alignment of human and AI relevance determinations. We propose using transpilers to bootstrap scalable code search benchmark datasets in other PLs and in a case study demonstrate that human-AI relevance agreement rates largely match the (worst case) human-human agreement under study. The application code used in this work is available at https://github.com/rlucas7/code-searcher.
Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ''ground truth'' using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions--by both humans and LLMs--can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff's alpha, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure. This evaluation method tests whether an LLM can blend into a group of human annotators without being distinguishable. We apply this approach to two datasets, MovieLens 100K and PolitiFact, and find that the LLM is statistically indistinguishable from a human annotator in MovieLens 100K (p = 0.004), but not in PolitiFact (p = 0.155), highlighting task-dependent differences.
Generative Large Language Models (LLMs) like GPT, Gemini, and Llama are transforming Information Retrieval, enabling new and more effective approaches to document retrieval and ranking. The switch from the previous generation pre-trained language models backbones (e.g., BERT, T5) to the new generative LLMs backbones has required the field to adapt training processes; it also has provided unprecedented capabilities and opportunities, stimulating research into zero-shot approaches, reasoning approaches, reinforcement learning based training, and multilingual and multimodal applications. This tutorial will provide a structured overview of LLM-based retrievers and rankers, covering fundamental architectures, training paradigms, real-world deployment considerations, and open challenges and research directions.
The rapid progress of large language models (LLMs) has fundamentally reshaped information retrieval (IR) systems, including search engines and recommender systems, by enabling new capabilities and interaction paradigms. However, the integration of LLMs into IR pipelines also brings pressing challenges to trustworthiness, particularly in the form of bias, unfairness, and hallucination, which can significantly disrupt the information ecosystem. This tutorial provides a comprehensive overview of these challenges and their emerging mitigation strategies. We begin by presenting a unified perspective that frames bias, unfairness, and hallucination as manifestations of distribution mismatch, with mitigation strategies broadly conceptualized under distribution alignment. Building on this framework, we examine how these issues arise across three critical stages of LLM-integrated IR systems: data collection, model development, and result evaluation. For each stage, we systematically review recent findings, characterize distinct types of bias, unfairness, and hallucination, and discuss corresponding mitigation approaches. Finally, we outline open problems and highlight promising research directions that can advance trustworthy IR in the LLM era. By bridging multiple strands of research, this tutorial aims to raise awareness and provide actionable insights for researchers, practitioners, and stakeholders in both the IR and broader AI communities.
Large Language Models (LLMs) have significantly advanced Artificial Intelligence (AI), demonstrating impressive capabilities in language understanding, reasoning, and generation. However, their fixed context windows fundamentally limit their utility in sustained, complex human-computer interactions, leading to issues such as forgetting previous turns, lacking consistent personas, and an inability to perform long-horizon reasoning. While Retrieval-Augmented Generation (RAG) offers a promising solution by externalizing knowledge and providing LLMs with relevant information from external corpora, its traditional static retrieve-then-generate pipeline often struggles with dynamic knowledge integration, introduces noise, and overlooks structural relationships. This tutorial introduces the evolution from traditional RAG to advanced Long-Term Memory (LTM) mechanisms that equip LLM-based conversational agents with human-like memory capabilities. We will explore various LTM architectures, including textual, graph-based, and parametric memory, detailing their forms, operations (such as dynamic indexing, retrieval, updating, and consolidation), and multimodal integration strategies. The tutorial will cover cutting-edge systems (like Mem0), illustrating how they enable agents to maintain coherent conversations, personalize interactions, and perform complex reasoning over extended periods. We will also delve into evaluation benchmarks (e.g., LoCoMo and ZH-4O) as well as metrics that comprehensively assess these long-term memory capabilities. Finally, we will discuss current limitations and promising future research directions, particularly focusing on AI self-evolution, multimodal memory, and ethical considerations. This tutorial aims to provide a comprehensive understanding for researchers and practitioners interested in building the next generation of intelligent, memory-aware conversational AI agents.
Retrieval-Augmented Generation (RAG) has become a foundational paradigm for equipping large language models (LLMs) with external knowledge, playing a critical role in information retrieval and knowledge-intensive applications. However, conventional RAG systems typically adopt a static retrieve-then-generate pipeline and rely on in-context knowledge injection, which can be suboptimal for complex tasks that require multihop reasoning, adaptive information access, and deeper integration of external knowledge. Motivated by these limitations, the research community has moved beyond static retrieval and in-context knowledge injection. Among the emerging directions, this tutorial delves into two rapidly growing and complementary research areas on RAG: Dynamic RAG and Parametric RAG. Dynamic RAG adaptively determines when and what to retrieve during the LLM's generation process, enabling real-time adaptation to the LLM's evolving information needs. Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness. This tutorial offers a comprehensive overview of recent advances in these two research areas. Building upon our tutorial at SIGIR 2025, this tutorial has been updated to incorporate the latest research and recent developments in this field.
We propose a new SIGIR-AP workshop, which we call BREV-RAG (Beyond Relevance-based EValuation of RAG systems). While RAG evaluation primarily considers relevance, correctness, and groundedness, there probably are many other evaluation axes that need to be considered for the user and for downstream tasks. For example, the upcoming NTCIR-19 R2C2 (RAG Responses: Confident and Correct?) addresses the problem of returning an answer confidence score that aligns with answer accuracy. The first half of this workshop will feature refereed paper presentations, where the central theme is evaluating RAG with a novel evaluation axis. The second half of this workshop will be used for breakout discussions, where each group will choose an evaluation axis that they deem important for future RAG evaluations, and discuss possible and feasible approaches to implementing the evaluation scheme. As tangible output, we plan to publish the accepted papers in the CEUR workshop proceedings series, and will submit a workshop summary to SIGIR Forum (June 2026 issue).
In recent years, large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks, spanning language understanding, complex reasoning, and decision-making. However, they still face inherent limitations, such as hallucinations and outdated parametric knowledge. To mitigate these challenges, retrieval-augmented generation (RAG) has emerged as a promising technique and has attracted increasing attention. As RAG continues to be widely applied, an increasing number of challenges and limitations have surfaced, underscoring the urgent need for deeper, foundational research to advance and refine current RAG frameworks. Therefore, we propose to organize R³AG 2025, the second workshop on Refined and Reliable Retrieval-Augmented Generation, at SIGIR-AP 2025. This workshop seeks to bring together researchers and practitioners to re-examine and re-establish the core principles and practical implementations of refined and reliable RAG. This workshop will serve as a collaborative platform for academia and industry to exchange insights, discuss foundational issues and recent advancements. By the end of the workshop, we aim to arrive at a clearer understanding on promising directions for enhancing the reliability and applicability of RAG.