 Fernando Diaz
                                                    Carnegie Mellon University, USA
Fernando Diaz
                                                    Carnegie Mellon University, USA
                                            Recent advances in AI have heightened attention on the foundations of evaluation. As models become more performant, traditionalmetrics and benchmarks increasingly fail to capture meaningfuldifferences in system behavior. Indeed, Voorhees et al. observethat modern retrieval models have saturated high-precision metrics, calling for “new strategies and tools for building reliable testcollections.”I describe preference-based evaluation, a framework that reinterprets evaluation as an ordering over system behaviors ratherthan the computation of numeric scores. Although common inlaboratory studies and online evaluation, automatic evaluation methods—such as average precision or reciprocal rank—havetraditionally lacked preference-based counterparts. Drawing onfoundational work in information retrieval evaluation and social-choice theory, I introduce a family of methods for conducting efficient, automatic, preference-based evaluation. Through a series of experiments across retrieval and recommendation tasks, preference-based versions of precision, recall, and average precision all demonstrate substantially higher sensitivity, addressing recent trends of metric saturation.
 Min Zhang
                                                Tsinghua University, China
Min Zhang
                                                Tsinghua University, China
                                        With the deepening of research into LLMs, it is the right time to understand the similarities and distinctions between LLMs and human users. This talk addresses several questions from a user-centric viewpoint in information access tasks: How can we evaluate the performance of large models, and what is their efficacy? To what extent do LLMs' conversational behaviors differ from those of humans in IR tasks? How does their capacity for test-time learning from conversational reasoning experiences stack up against humans? Some of our recent explorations and findings on these questions will also be presented. Hope discussions on the related topics will offer some new perspectives and inspire future research into the behavior and reasoning mechanisms of LLMs in information access tasks.
 
                       