SIGIR-AP '23: Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

Full Citation in the ACM Digital Library

SESSION: Session 1: Conversational Search

A Comparative Study of Training Objectives for Clarification Facet Generation

Due to the ambiguity and vagueness of a user query, it is essential to identify the query facets for the clarification of user intents. Existing work on query facet generation has achieved compelling performance by sequentially predicting the next facet given previously generated facets based on pre-trained language generation models such as BART. Given a query, there are mainly two types of training objectives to guide the facet generation models. One is to generate the default sequence of ground-truth facets, and the other is to enumerate all the permutations of ground-truth facets and use the sequence that has the minimum loss for model updates. The second is permutation-invariant while the first is not. In this paper, we aim to conduct a systematic comparative study of various types of training objectives, with different properties of not only whether it is permutation-invariant but also whether it conducts sequential prediction and whether it can control the count of output facets. To this end, we propose another three training objectives of different aforementioned properties. For comprehensive comparisons, besides the commonly used evaluation that measures the matching with ground-truth facets, we also introduce two diversity metrics to measure the diversity of the generated facets. Based on an open-domain query facet dataset, i.e., MIMICS, we conduct extensive analyses and show the pros and cons of each method, which could shed light on model training for clarification facet generation. The code can be found at https://github.com/ShiyuNee/Facet-Generation.

Retrieving Supporting Evidence for Generative Question Answering

Current large language models (LLMs) can exhibit near-human levels of performance on many natural language-based tasks, including open-domain question answering. Unfortunately, at this time, they also convincingly hallucinate incorrect answers, so that responses to questions must be verified against external sources before they can be accepted at face value. In this paper, we report two simple experiments to automatically validate generated answers against a corpus. We base our experiments on questions and passages from the MS MARCO (V1) test collection, and a retrieval pipeline consisting of sparse retrieval, dense retrieval and neural rerankers. In the first experiment, we validate the generated answer in its entirety. After presenting a question to an LLM and receiving a generated answer, we query the corpus with the combination of the question + generated answer. We then present the LLM with the combination of the question + generated answer + retrieved answer, prompting it to indicate if the generated answer can be supported by the retrieved answer. In the second experiment, we consider the generated answer at a more granular level, prompting the LLM to extract a list of factual statements from the answer and verifying each statement separately. We query the corpus with each factual statement and then present the LLM with the statement and the corresponding retrieved evidence. The LLM is prompted to indicate if the statement can be supported and make necessary edits using the retrieved material. With an accuracy of over 80%, we find that an LLM is capable of verifying its generated answer when a corpus of supporting material is provided. However, manual assessment of a random sample of questions reveals that incorrect generated answers are missed by this verification process. While this verification process can reduce hallucinations, it can not entirely eliminate them.

Recommending Answers to Math Questions Based on KL-Divergence and Approximate XML Tree Matching

Math is the science and study of quality, structure, space, and change. It seeks out patterns, formulates new conjectures, and establishes the truth by rigorous deduction from appropriately chosen axioms and definitions. The study of math makes a person better at solving problems. It gives someone skills that can use across other subjects and apply in different job roles. In the modern world, builders use math every day to do their work, since construction workers add, subtract, divide, multiply, and work with fractions. It is obvious that math is a major contributor to many areas of study. For this reason, math information retrieval (Math IR) deserves attention and recognition, since a reliable Math IR system helps users find relevant answers to math questions and benefits all math learners whenever they need help solve a math problem, regardless of the time and place. Moreover, Math IR systems enhance the learning experience of their users. In this paper, we present MaRec, a recommender system that retrieves and ranks math answers based on their textual content and embedded formulas in answering a math question. MaRec ranks a potential answer A given a math question Q by computing the (i) KL-divergence score on A and Q using their textual contents, and (ii) the subtree matching score of the math formulas in Q and A represented as XML trees. The design of MaRec is simple and easy to understand, since it solely relies on a probability model and an elegant tree-matching approach in ranking math answers. Conducted empirical studies show that MaRec significantly outperforms (i) three existing state-of-the-art MathIR systems based on an offline evaluation, and (ii) two top-of-the-line machine learning systems based on an online analysis.

EALM: Introducing Multidimensional Ethical Alignment in Conversational Information Retrieval

Artificial intelligence (AI) technologies should adhere to human norms to better serve our society and avoid disseminating harmful or misleading information, particularly in Conversational Information Retrieval (CIR). Previous work, including approaches and datasets, has not always been successful or sufficiently robust in taking human norms into consideration. To this end, we introduce a workflow that integrates ethical alignment, with an initial ethical judgment stage for efficient data screening. To address the need for ethical judgment in CIR, we present the QA-ETHICS dataset, adapted from the ETHICS benchmark, which serves as an evaluation tool by unifying scenarios and label meanings. However, each scenario only considers one ethical concept. Therefore, we introduce the MP-ETHICS dataset to evaluate a scenario under multiple ethical concepts, such as justice and Deontology. In addition, we suggest a new approach that achieves top performance in both binary and multi-label ethical judgment tasks. Our research provides a practical method for introducing ethical alignment into the CIR workflow. The data and code are available at https://github.com/wanng-ide/ealm.

Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores

Existing dialogue quality evaluation systems can return a score for a given system turn from a particular viewpoint, e.g., engagingness. However, to improve dialogue systems by locating exactly where in a system turn potential problems lie, a more fine-grained evaluation may be necessary. We therefore propose an evaluation approach where a turn is decomposed into nuggets (i.e., expressions associated with a dialogue act), and nugget-level evaluation is enabled by leveraging an existing turn-level evaluation system. We demonstrate the potential effectiveness of our evaluation method through a case study.

ChatGPT Hallucinates when Attributing Answers

Can ChatGPT provide evidence to support its answers? Does the evidence it suggests actually exist and does it really support the answer? We investigate these questions using a collection of domain-specific knowledge-based questions, specifically prompting ChatGPT to provide both an answer and supporting evidence in the form of references to external sources. We also investigate how different prompts impact answers and evidence.

We find that ChatGPT provides correct or partially correct answers of the time; however, the suggested references actually only exist 14% of the time. We further provide insights on the generated references that reveal common traits among the references that ChatGPT generates, and show how even if a reference provided by the model does exist, this reference often does not always support the claims ChatGPT attributes to it.

Our findings are important because (1) they are the first systematic analysis of the references generated by ChatGPT in its answers; (2) they suggest that the model may leverage good quality information in producing correct answers, but is unable to attribute real evidence to support its answers. Prompts, raw result files and manual analysis are made publicly available at https://github.com/ielab/LLM-attribution.

SESSION: Session 2: Domains and Applications

Multimodal Fashion Knowledge Extraction as Captioning

Social media plays a significant role in boosting the fashion industry, where a massive amount of fashion-related posts are generated every day. In order to obtain the rich fashion information from the posts, we study the task of social media fashion knowledge extraction. Fashion knowledge, which typically consists of the occasion, person attributes, and fashion item information, can be effectively represented as a set of tuples. Most previous studies on fashion knowledge extraction are based on the fashion product images without considering the rich text information in social media posts. Existing work on fashion knowledge extraction in social media is classification-based and requires to manually determine a set of fashion knowledge categories in advance. In our work, we propose to cast the task as a captioning problem to capture the interplay of the multimodal post information. Specifically, we transform the fashion knowledge tuples into a natural language caption with a sentence transformation method. Our framework then aims to generate the sentence-based fashion knowledge directly from the social media post. Inspired by the big success of pre-trained models, we build our model based on a multimodal pre-trained generative model and design several auxiliary tasks for enhancing the knowledge extraction. Since there is no existing dataset which can be directly borrowed to our task, we introduce a dataset consisting of social media posts with manual fashion knowledge annotation. Extensive experiments are conducted to demonstrate the effectiveness of our model.

Chuweb21D: A Deduped English Document Collection for Web Search Tasks

As a traditional information retrieval task, ad hoc web search has long been an important part of IR research and evaluation tracks (e.g. TREC, NTCIR and CLEF). A crawled, large-scale web document collection is a central component to offline web search evaluation. Although there already exist several English document collections, such as the ClueWeb series, GOV2 and c4, a collection that satisfies properties of both strong timeliness and raw HTML formatting is still relatively scarce.

To better support the demands of nascent web search tasks, we have built and publicly released Chuweb21D, a large-scale deduped English document collection for web search tasks. The Chuweb21D collection is derived from Chuweb21, which we released in April 2021 as a target corpus for the NTCIR-16 WWW-4 Task. We applied two different deduping thresholds to obtain two versions of Chuweb21D, called Chuweb21D-60 and Chuweb21D-70; the former is used as the target corpus for the ongoing NTCIR-17 FairWeb-1 task. To gain an insight into the impact of deduping, we evaluate the runs submitted to the NTCIR-16 WWW-4 task using Chuweb21D, and compare the outcome with the official results that used the corpus before deduping.

Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation

Screening prioritisation in medical systematic reviews aims to rank the set of documents retrieved by complex Boolean queries. Prioritising the most important documents ensures that subsequent review steps can be carried out more efficiently and effectively. The current state of the art uses the final title of the review as a query to rank the documents using BERT-based neural rankers. However, the final title is only formulated at the end of the review process, which makes this approach impractical as it relies on ex post facto information. At the time of screening, only a rough working title is available, with which the BERT-based ranker performs significantly worse than with the final title. In this paper, we explore alternative sources of queries for prioritising screening, such as the Boolean query used to retrieve the documents to be screened and queries generated by instruction-based generative large-scale language models such as ChatGPT and Alpaca. Our best approach is not only viable based on the information available at the time of screening, but also has similar effectiveness to the final title.

Examination of Information Problem Decomposition Strategies: A New Perspective for Understanding Users' Information Problems in Search as Learning

When searching for information in search as learning (SAL), users with anomalous state of knowledge often encounter the difficulty in clearly clarifying their information needs. However, it might be easier for them to express their information problems rather than information needs. In this study, we proposed the concept of “information problem decomposition strategies (IPDS)” to help searchers explicitly express their information problems (IPs) as well as to facilitate their information problem solving (IPS) process. To unfold users’ dynamic cognitive process within information problem decomposition process during search, we conducted an experiment, and collected searchers’ sub-information problems through pre-search and post-search questionnaires, experiment recordings, and post-experiment interview. Furthermore, we also proposed a new measurement of learning performance, information problem solving degree, to evaluate how each sub-information problem has been solved during search in order to assess the search performance from the process perspective.

Through experiment, we have successfully collected searchers’ decomposed sub-information problems and found that the number of their queries was significantly positively correlated with the number of decomposed sub-information problems. Thus, it is possible to reveal the dynamic cognitive process within users’ information search by using sub-information problems. We also summarized 4 typical sub-information problems decomposed categories, namely “What type”, “How type”, “Why type” and “When type”, and identified specific decomposed pathways within each category. It is found that current search systems are not good at supporting sub-information problems on higher levels of cognitive complexity in SAL. However, some of the identified five information problem decomposition strategies (IPDSs) within the decomposed sub-information problems clusters could help increase IPS degree and shed lights on the recommendation terms or directly returned answers of search systems.

ReviVal: Towards Automatically Evaluating the Informativeness of Peer Reviews

The peer-review process is currently under stress due to the increasingly large number of submissions to top-tier venues, especially in Artificial Intelligence (AI) and Machine Learning (ML). Consequently, the quality of peer reviews is under question, and dissatisfaction among authors is not uncommon but rather prominent. In this work, we propose "ReviVal" (expanded as "REVIew eVALuation"), a system to automatically grade a peer-review report for its informativeness. We define review informativeness in terms of its Exhaustiveness and Strength, where Exhaustiveness signifies how exhaustively the review covers the different sections and qualitative aspects1 of the paper and Strength signifies how sure the reviewer is of their evaluation. We train ReviVal, a multitask deep network for review informativeness prediction on the publicly available peer reviews, which we curate from the openreview2 platform. We annotate the review sentence(s) with labels for (a) which sections and (b) what quality aspects of the paper those refer. We automatically annotate our data with the reviewer’s sentiment intensity to capture the reviewer’s conviction. Our approach significantly outperforms several intuitive baselines for this novel task. To the best of our knowledge, our work is a first-of-its-kind to automatically estimate the informativeness of a peer review report.

User-Meal Interaction Learning for Meal Recommendation: A Reproducibility Study

Recommendation systems are important in web3.0 as a technology to achieve high-quality interaction between people and the web. Meal recommender system, as an application of bundle recommendation, aims to provide courses from specific categories (e.g., appetizer, main dish) that are enjoyed as a meal for a user. Common methods include collaborative filtering, attention-based and graph-based ones, etc. User-meal interaction learning is the core of a meal recommendation and its key feature is category-constrained. However, there is no work to compare and study different types of methods. Moreover, existing work does not make full use of category information in the meal recommendation. In this paper, we conduct a reproducibility study about user-meal interaction learning for meal recommendation. We reproduce seven state-of-the-art meal and general bundle recommendation models and re-implement two models with considering category information. Extensive experiments are conducted on two datasets with different user-meal interaction densities to explore the impact of data density on different types of user-meal interaction learning, and investigate the effectiveness of different category-wise implementations. The experimental results are instructive and beneficial to the development and application of meal recommendation research.

SESSION: Session 3: Learning and Ranking

RFR: Representation-Focused Replay for Overcoming the Catastrophic Forgetting in Lifelong Language Learning

Replay-based approaches can combine with regularization-based approaches by introducing additional regularization terms or optimization constraints to alleviate catastrophic forgetting in lifelong language learning. The typical approach usually penalizes changes in the mapping function of a neural network. However, this constraint requires that the parameters of the whole network space are restricted when the model is learning the new tasks. It requires that the solution space of the model can only be near the original solution space, which limits the learning ability of the model on new data. To address this issue, we propose a novel approach called the Representation-Focused Replay (RFR) approach for lifelong language learning. RFR opts to prevent the changes of representations of replay examples while training with new data by introducing the differences in representations as the optimization constraint. Extensive experiments conducted on text classification benchmarks demonstrate the effectiveness of our proposed method. The experimental results show that RFR achieves an average accuracy of 78.2. Compared with the state-of-the-art baselines, RFR achieves higher accuracy on some task sequences and is close to the upper bound for the multi-task learning method.

Towards Sequential Counterfactual Learning to Rank

Counterfactual evaluation plays a crucial role in learning-to-rank problems, as it addresses the discrepancy between the data logging policy and the policy being evaluated, due to the presence of presentation bias. Existing counterfactual methods, which are based on the empirical risk minimization framework, aim to evaluate the ability of a ranking policy to produce optimal results for a single query using implicit feedback from logged data. In real-world scenarios, however, users often reformulate their queries multiple times until they find what they are looking for and then provide a feedback signal. In such circumstances, current counterfactual approaches cannot assess a policy’s effectiveness in delivering satisfactory results to the user over consecutive queries during a search session. Taking sequential search behavior into account, we propose the first counterfactual estimator for session ranking metrics under sequential position-based models and conduct preliminary experiments to shed light on further research in this direction.

Unbiased Top-$k$ Learning to Rank with Causal Likelihood Decomposition

Unbiased learning to rank methods have been proposed to address biases in search ranking. These biases, known as position bias and sample selection bias, often occur simultaneously in real applications. Existing approaches either tackle these biases separately or treat them as identical, leading to incomplete elimination of both biases. This paper employs a causal graph approach to investigate the mechanisms and interplay between position bias and sample selection bias. The analysis reveals that position bias is a common confounder bias, while sample selection bias falls under the category of collider bias. These biases collectively introduce a cascading process that leads to biased clicks. Based on our analysis, we propose Causal Likelihood Decomposition (CLD), a unified method that effectively mitigates both biases in top-k learning to rank. CLD removes position bias by leveraging propensity scores and then decomposes the likelihood of selection biased data into sample selection bias term and relevance term. By maximizing the overall log-likelihood function, we obtain an unbiased ranking model from the relevance term. We also extend CLD to pairwise neural ranking. Extensive experiments demonstrate that CLD and its pairwise neural extension outperform baseline methods by effectively mitigating both position bias and sample selection bias. The robustness of CLD is further validated through empirical studies considering variations in bias severity and click noise.

Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection

Search methods based on Pretrained Language Models (PLM) have demonstrated great effectiveness gains compared to statistical and early neural ranking models. However, fine-tuning PLM-based rankers requires a great amount of annotated training data. Annotating data involves a large manual effort and thus is expensive, especially in domain specific tasks. In this paper we investigate fine-tuning PLM-based rankers under limited training data and budget. We investigate two scenarios: fine-tuning a ranker from scratch, and domain adaptation starting with a ranker already fine-tuned on general data, and continuing fine-tuning on a target dataset.

We observe a great variability in effectiveness when fine-tuning on different randomly selected subsets of training data. This suggests that it is possible to achieve effectiveness gains by actively selecting a subset of the training data that has the most positive effect on the rankers. This way, it would be possible to fine-tune effective PLM rankers at a reduced annotation budget. To investigate this, we adapt existing Active Learning (AL) strategies to the task of fine-tuning PLM rankers and investigate their effectiveness, also considering annotation and computational costs. Our extensive analysis shows that AL strategies do not significantly outperform random selection of training subsets in terms of effectiveness. We further find that gains provided by AL strategies come at the expense of more assessments (thus higher annotation costs) and AL strategies underperform random selection when comparing effectiveness given a fixed annotation cost. Our results highlight that “optimal” subsets of training data that provide high effectiveness at low annotation cost do exist, but current mainstream AL strategies applied to PLM rankers are not capable of identifying them.

Enhancing Sparse Retrieval via Unsupervised Learning

Recent work has shown that neural retrieval models excel at text ranking tasks in a supervised setting when given large amounts of manually labeled training data. However, it remains an open question how to train unsupervised retrieval models that are more effective than baselines such as BM25. While some progress has been made in unsupervised dense retrieval models within a bi-encoder architecture, unsupervised sparse retrieval models remain unexplored. We propose BM26, to our knowledge the first such model, which is trained in an unsupervised manner without the need for any human relevance judgments. Evaluations with multiple test collections show that BM26 performs on par with BM25 and outperforms Contriever, the current state-of-the-art unsupervised dense retriever. We further demonstrate two promising avenues to enhance lexical retrieval: First, we can combine BM25 and BM26 using simple vector concatenation to yield an unsupervised hybrid BM51 model that significantly improves over BM25 alone. Second, we can enhance supervised sparse models such as SPLADE with improved initialization using BM26, yielding significant improvements in in-domain and zero-shot retrieval effectiveness.

SESSION: Session 4: Legal Search and Indexing

Result Diversification for Legal case Retrieval

Legal case retrieval has received considerable attention in the last decade. As more and more legal documents are collected and stored in digital form, the need for efficient and reliable access to relevant information in large-scale legal databases continues to grow. While most existing studies have focused on differentiating relevant documents from irrelevant ones based on their similarity to the query case, user studies have revealed that similarity is not the sole concern in legal case retrieval. In many instances, users require not only cases that are similar in content but also cases that encompass a broad range of subtopics (i.e., charges) related to the query case. In contrast to open-domain retrieval, such as web search, our research has found that search diversification in legal case retrieval involves a smaller number of highly correlated subtopics. To address this issue, we have constructed a Diversity Legal Retrieval dataset (DLR-dataset) that includes query-charge labels and charge-level relevance labels between the query case and candidate cases. Additionally, we propose a Diversified Legal Case Retrieval Model (DLRM) that simultaneously considers topical relevance and subtopic relationships using a legal relationship graph. Experimental results demonstrate that DLRM outperforms existing diversified search baselines in the field of legal retrieval.

Investigating the Influence of Legal Case Retrieval Systems on Users' Decision Process

Given a specific query case, legal case retrieval systems aim to retrieve a set of case documents relevant to the case at hand. Previous studies on user behavior analysis have shown that information retrieval (IR) systems can significantly influence users’ decisions by presenting results in varying orders and formats. However, whether such influence exists in legal case retrieval remains largely unknown. This study presents the first investigation into the influence of legal case retrieval systems on the decision-making process of legal users. We conducted an online user study involving more than ninety participants, and our findings suggest that the result distribution of legal case retrieval systems indeed affects users’ judgements on the sentences in cases. Notably, when users are presented with biased results that involve harsher sentences, they tend to impose harsher sentences on the current case as well. This research highlights the importance of optimizing the unbiasedness of legal case retrieval systems.

Boosting legal case retrieval by query content selection with large language models

Legal case retrieval, which aims to retrieve relevant cases to a given query case, benefits judgment justice and attracts increasing attention. Unlike generic retrieval queries, legal case queries are typically long and the definition of relevance is closely related to legal-specific elements. Therefore, legal case queries may suffer from noise and sparsity of salient content, which hinders retrieval models from perceiving correct information in a query. While previous studies have paid attention to improving retrieval models and understanding relevance judgments, we focus on enhancing legal case retrieval by utilizing the salient content in legal case queries. We first annotate the salient content in queries manually and investigate how sparse and dense retrieval models attend to those content. Then we experiment with various query content selection methods utilizing large language models (LLMs) to extract or summarize salient content and incorporate it into the retrieval models. Experimental results show that reformulating long queries using LLMs improves the performance of both sparse and dense models in legal case retrieval.

Lossy Compression Options for Dense Index Retention

Dense indexes derived from whole-of-document neural models are now more effective at locating likely-relevant documents than are conventional term-based inverted indexes. That effectiveness comes at a cost, however: inverted indexes require less than a byte per posting to store, whereas dense indexes store a fixed-length vector of floating point coefficients (typically 768) for each document, making them potentially an order of magnitude larger. In this paper we consider compression of indexes employing dense vectors. Only limited space savings can be achieved via lossless compression techniques, but we demonstrate that dense indexes are responsive to lossy techniques that sacrifice controlled amounts of numeric resolution in order to gain compressibility. We describe suitable schemes, and, via experiments on three different collections, show that substantial space savings can be achieved with minimal loss of ranking fidelity. These techniques further boost the attractiveness of dense indexes for practical use.

How to Index Item IDs for Recommendation Foundation Models

Recommendation foundation model utilizes large language models (LLM) for recommendation by converting recommendation tasks into natural language tasks. It enables generative recommendation which directly generates the item(s) to recommend rather than calculating a ranking score for each and every candidate item as in traditional recommendation models, simplifying the recommendation pipeline from multi-stage filtering to single-stage filtering. To avoid generating excessively long text and hallucinated recommendations when deciding which item(s) to recommend, creating LLM-compatible item IDs to uniquely identify each item is essential for recommendation foundation models. In this study, we systematically examine the item ID creation and indexing problem for recommendation foundation models, using P5 as an example of the backbone LLM. To emphasize the importance of item indexing, we first discuss the issues of several trivial item indexing methods, such as random indexing, title indexing, and independent indexing. We then propose four simple yet effective solutions, including sequential indexing, collaborative indexing, semantic (content-based) indexing, and hybrid indexing. Our study highlights the significant influence of item indexing methods on the performance of LLM-based recommendation, and our results on real-world datasets validate the effectiveness of our proposed solutions. The research also demonstrates how recent advances on language modeling and traditional IR principles such as indexing can help each other for better learning and inference. Source code and data are available at https://github.com/Wenyueh/LLM-RecSys-ID.

SESSION: Session 5: Neural Search and Fairness

Adaptive Learning on User Segmentation: Universal to Specific Representation via Bipartite Neural Interaction

Recently, models for user representation learning have been widely applied in click-through-rate (CTR) and conversion-rate (CVR) prediction. Usually, the model learns a universal user representation as the input for subsequent scenario-specific models. However, in numerous industrial applications (e.g., recommendation and marketing), the business always operates such applications as various online activities among different user segmentation. These segmentation are always created by domain experts. Due to the difference in user distribution (i.e., user segmentation) and business objectives in subsequent tasks, learning solely on universal representation may lead to detrimental effects on both model performance and robustness. In this paper, we propose a novel learning framework that can first learn general universal user representation through information bottleneck. Then, merge and learn a segmentation-specific or a task-specific representation through neural interaction. We design the interactive learning process by leveraging a bipartite graph architecture to model the representation learning and merging between contextual clusters and each user segmentation. Our proposed method is evaluated in two open-source benchmarks, two offline business datasets, and deployed on two online marketing applications to predict users’ CVR. The results demonstrate that our method can achieve superior performance and surpass the baseline methods.

Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on fine-tuning strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components.

To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel pre-training strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder.

Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.

Selecting which Dense Retriever to use for Zero-Shot Search

We propose the new problem of choosing which dense retrieval model to use when searching on a new collection for which no labels are available, i.e. in a zero-shot setting. Many dense retrieval models are readily available. Each model however is characterized by very differing search effectiveness – not just on the test portion of the datasets in which the dense representations have been learned but, importantly, also across different datasets for which data was not used to learn the dense representations. This is because dense retrievers typically require training on a large amount of labeled data to achieve satisfactory search effectiveness in a specific dataset or domain. Moreover, effectiveness gains obtained by dense retrievers on datasets for which they are able to observe labels during training, do not necessarily generalise to datasets that have not been observed during training.

This is however a hard problem: through empirical experimentation we show that methods inspired by recent work in unsupervised performance evaluation with the presence of domain shift in the area of computer vision and machine learning are not effective for choosing highly performing dense retrievers in our setup. The availability of reliable methods for the selection of dense retrieval models in zero-shot settings that do not require the collection of labels for evaluation would allow to streamline the widespread adoption of dense retrieval. This is therefore an important new problem we believe the information retrieval community should consider. Implementation of methods, along with raw result files and analysis scripts are made publicly available at https://github.com/ielab/dr-model-selection

Vertical Allocation-based Fair Exposure Amortizing in Ranking

Result ranking often affects consumer satisfaction as well as the amount of exposure each item receives in the ranking services. Myopically maximizing customer satisfaction by ranking items only according to relevance will lead to unfair distribution of exposure for items, followed by unfair opportunities and economic gains for item producers/providers. Such unfairness will force providers to leave the system and discourage new providers from coming in. Eventually, fewer purchase options would be left for consumers, and the utilities of both consumers and providers would be harmed. Thus, to maintain a balance between ranking relevance and fairness is crucial for both parties. In this paper, we focus on the exposure fairness in ranking services. We demonstrate that existing methods for amortized fairness optimization could be suboptimal in terms of fairness-relevance tradeoff because they fail to utilize the prior knowledge of consumers. We further propose a novel algorithm named Vertical Allocation-based Fair Exposure Amortizing in Ranking, or VerFair, to reach a better balance between exposure fairness and ranking performance. Extensive experiments on three real-world datasets show that VerFair significantly outperforms state-of-the-art fair ranking algorithms in fairness-performance trade-offs from both the individual level and the group level.

Automatic Feature Fairness in Recommendation via Adversaries

Fairness is a widely discussed topic in recommender systems, but its practical implementation faces challenges in defining sensitive features while maintaining recommendation accuracy. We propose feature fairness as the foundation to achieve equitable treatment across diverse groups defined by various feature combinations. This improves overall accuracy through balanced feature generalizability. We introduce unbiased feature learning through adversarial training, using adversarial perturbation to enhance feature representation. The adversaries improve model generalization for under-represented features. We adapt adversaries automatically based on two forms of feature biases: frequency and combination variety of feature values. This allows us to dynamically adjust perturbation strengths and adversarial training weights. Stronger perturbations are applied to feature values with fewer combination varieties to improve generalization, while higher weights for low-frequency features address training imbalances. We leverage the Adaptive Adversarial perturbation based on the widely-applied Factorization Machine (AAFM) as our backbone model. In experiments, AAFM surpasses strong baselines in both fairness and accuracy measures. AAFM excels in providing item- and user-fairness for single- and multi-feature tasks, showcasing their versatility and scalability. To maintain good accuracy, we find that adversarial perturbation must be well-managed: during training, perturbations should not overly persist and their strengths should decay.

SESSION: Session 6: Recommendation

Sequential Recommendation with User Evolving Preference Decomposition

Modeling user sequential behaviors has recently attracted increasing attention in the recommendation domain. Existing methods mostly assume coherent preference in the same sequence. However, user personalities are volatile and easily changed, and there can be multiple mixed preferences underlying user behaviors. To solve this problem, in this paper, we propose a novel sequential recommender model via decomposing and modeling user independent preferences. To achieve this goal, we highlight three practical challenges considering the inconsistent, evolving and uneven nature of the user behaviors. For overcoming these challenges in a unified framework, we introduce a reinforcement learning module to simulate the evolution of user preference. More specifically, the action aims to allocate each item into a sub-sequence or create a new one according to how the previous items are decomposed as well as the time interval between successive behaviors. The reward is associated with the final loss of the learning objective, aiming to generate sub-sequences which can better fit the training data. We conduct extensive experiments based on eight real-world datasets across different domains. Comparing with the state-of-the-art methods, empirical studies manifest that our model can on average improve the performance by about 9.68%, 12.4%, 8.56% and 7.13% on the metrics of Precision, Recall, NDCG and MRR, respectively.

Multi-Behavior Job Recommendation with Dynamic Availability

In recent years, we can see a lot of job postings on the Internet, providing us with more diverse job opportunities. As a result, it is getting more and more difficult for job seekers to find job postings relevant to their preferences. Consequently, job recommendations play an important role to reduce the burden of job searching. Generally, job postings have a publication period, for example, 30 days. Then they have been expired since the positions were occupied. As a result, job seekers may be frustrated when they experience such situations as they cannot apply for the positions. This indicates that job seekers may have strong preferences for job postings even if their application behaviors cannot be observed. This kind of gap has not been investigated in the line of Multi-Behavior Recommendation. Therefore, in this work, we propose a new job recommendation model, called Multi-Behavior Job Recommendation with Dynamic Availability (MBJ-DA), which takes into account: (1) auxiliary behaviors other than an application behavior and (2) the influence of dynamic availability of job postings. MBJ-DA enables a more accurate estimation of each user’s actual preferences by explicitly distinguishing the noise potentially inherent in auxiliary behaviors. Furthermore, by explicitly considering the influence of the dynamic availability of job postings, MBJ-DA can mitigate biases resulting from the influence and estimate each user’s actual preferences more accurately. Experimental results on our dataset constructed from an actual job search website show that MBJ-DA outperforms several state-of-the-arts in terms of MRR and nDCG.

AdaReX: Cross-Domain, Adaptive, and Explainable Recommender System

Explainability is an inherent issue of recommender systems and has received a lot of attention recently. Generative explainable recommendation, which provides personalized explanations by generating textual rationales, is emerging as an effective solution. Despite promising, current methods face limitations in their reliance on dense training data, which hinders the generalizability of explainable recommender systems. Our work tackles a novel problem of cross-domain explainable recommendation aiming to extend the generalizability of explainable recommender systems. To solve this, we propose a novel approach that models aspects extracted from past reviews, to empower the explainable recommender systems by leveraging knowledge from other domains. Specifically, we propose AdaReX (Adaptive eXplainable Recommendation), to model auxiliary and target domains simultaneously. By performing specific tasks in respective domains and their interconnection via a discriminator model, AdaReX allows the aspect sequences to learn common knowledge across different domains and tasks. Furthermore, through our proposed optimization objective, the learning of aspect sequence is deeply cross-interacted with in-domain users and items’ latent factors, enabling the enhanced sharing of knowledge between domains. Our extensive experiments on real datasets demonstrate that our approach not only generates better explanations and recommendations for sparse users but also improves performance for general users.

Reinforcement Re-ranking with 2D Grid-based Recommendation Panels

Modern recommender systems usually present items as a streaming, one-dimensional ranking list. Recently there is a trend in e-commerce that the recommended items are organized grid-based panels with two dimensions where users can view the items in both vertical and horizontal directions. Presenting items in grid-based result panels poses new challenges to recommender systems because existing models are all designed to output sequential lists while the slots in a grid-based panel have no explicit order. Directly converting the item rankings into grids (e.g., pre-defining an order on the slots) overlooks the user-specific behavioral patterns on grid-based panels and inevitably hurts the user experiences. To address this issue, we propose a novel Markov decision process (MDP) to place the items in 2D grid-based result panels at the final re-ranking stage of the recommender systems. The model, referred to as Panel-MDP, takes an initial item ranking from the early stages as the input. Then, it defines the MDP discrete time steps as the ranks in the initial ranking list, and the actions as the prediction of the user-item preference and the selection of the slots. At each time step, Panel-MDP sequentially executes two sub-actions: first deciding whether the current item in the initial ranking list is preferred by the user; then selecting a slot for placing the item if preferred, or skipping the item otherwise. The process is continued until all of the panel slots are filled. The reinforcement learning algorithm of PPO is employed to implement and learn the parameters in the Panel-MDP. Simulation and experiments on a dataset collected from a widely-used e-commerce app demonstrated the superiority of Panel-MDP in terms of recommending 2D grid-based result panels.

SE-PEF: a Resource for Personalized Expert Finding

The problem of personalization in Information Retrieval has been under study for a long time. A well-known issue related to this task is the lack of publicly available datasets to support a comparative evaluation of personalized search systems. To contribute in this respect, this paper introduces SE-PEF (StackExchange - Personalized Expert Finding), a resource useful for designing and evaluating personalized models related to the Expert Finding (EF) task. The contributed dataset includes more than 250k queries and 565k answers from 3 306 experts, which are annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. The results of the preliminary experiments conducted show the appropriateness of SE-PEF to evaluate and to train effective EF models.

SESSION: Tutorials

Recent Advances in Generative Information Retrieval

Generative retrieval (GR) has become a highly active area of information retrieval (IR) that has witnessed significant growth recently. Compared to the traditional “index-retrieve-then-rank” pipeline, the GR paradigm aims to consolidate all information within a corpus into a single model. Typically, a sequence-to-sequence model is trained to directly map a query to its relevant document identifiers (i.e., docids). This tutorial offers an introduction to the core concepts of the GR paradigm and a comprehensive overview of recent advances in its foundations and applications. We start by providing preliminary information covering foundational aspects and problem formulations of GR. Then, our focus shifts towards recent progress in docid design, training approaches, inference strategies, and the applications of GR. We end by outlining remaining challenges and issuing a call for future GR research. This tutorial is intended to be beneficial to both researchers and industry practitioners interested in developing novel GR solutions or applying them in real-world scenarios.

Rethinking Conversational Agents in the Era of LLMs: Proactivity, Non-collaborativity, and Beyond

Conversational systems are designed to offer human users social support or functional services through natural language interactions. Typical conversation researches mainly focus on the response-ability of the system, such as dialogue context understanding and response generation. In the era of large language models (LLMs), LLM-augmented conversational systems showcase exceptional capabilities of responding to user queries for different language tasks. However, as LLMs are trained to follow users’ instructions, LLM-augmented conversational systems typically overlook the design of an essential property in intelligent conversations, i.e., goal awareness. In this tutorial, we will introduce the recent advances on the design of agent’s awareness of goals in a wide range of conversational systems, including proactive, non-collaborative, and multi-goal conversational systems. In addition, we will discuss the main open challenges in developing agent’s goal awareness in LLM-augmented conversational systems and several potential research directions for future studies.

Rethinking Conversational Agents in the Era of LLMs: Proactivity, Non-collaborativity, and Beyond

With the emergence of various information access systems exhibiting increasing complexity, there is a critical need for sound and scalable means of automatic evaluation. To address this challenge, user simulation emerges as a promising solution. This half-day tutorial focuses on providing a thorough understanding of user simulation techniques designed specifically for evaluation purposes. We systematically review major research progress, covering both general frameworks for designing user simulators, and specific models and algorithms for simulating user interactions with search engines, recommender systems, and conversational assistants. We also highlight some important future research directions.

Large Language Models for Recommendation: Progresses and Future Directions

The powerful large language models (LLMs) have played a pivotal role in advancing recommender systems. Recently, in both academia and industry, there has been a surge of interest in developing LLMs for recommendation, referred to as LLM4Rec. This includes endeavors like leveraging LLMs for generative item retrieval and ranking, as well as the exciting possibility of building universal LLMs for diverse open-ended recommendation tasks. These developments hold the potential to reshape the traditional recommender paradigm, paving the way for the next-generation recommender systems.

In this tutorial, we aim to retrospect the evolution of LLM4Rec and conduct a comprehensive review of existing research. In particular, we will clarify how recommender systems benefit from LLMs through a variety of perspectives, including the model architecture, learning paradigm, and the strong abilities of LLMs such as chatting, generalization, planning, and generation. Furthermore, we will discuss the critical challenges and open problems in this emerging field, for instance, the trustworthiness, efficiency, and model retraining issues. Lastly, we will summarize the implications of previous work and outline future research directions. We believe that this tutorial will assist the audience in better understanding the progress and prospects of LLM4Rec, inspiring them for future exploration. This, in turn, will drive the prosperity of LLM4Rec, possibly fostering a paradigm shift in recommendation systems.