The Third Workshop on Large Language Models (LLMs) for Evaluation in Information Retrieval
The Third Workshop on LLM4Eval progresses the discussion from the previous series. These earlier events investigated the potential and challenges of using LLMs for search relevance evaluation, automated judgments, and retrieval-augmented generation (RAG) assessment. As modern IR systems integrate search, recommendations, conversational interfaces, and personalization, new evaluation challenges arise beyond basic relevance assessment. These applications create personalized rankings, explanations, and adapt to user preferences over time, requiring new evaluation methods. While LLMs can effectively generate relevance judgments, they struggle with assessing subjective aspects of IR systems, such as interaction quality, explanation effectiveness, and trustworthiness, often requiring human judgment. The main goal of the third LLM4Eval workshop is to bring together researchers from industry and academia to explore three critical areas: the evaluation of personalized IR systems while maintaining fairness, the boundaries between automated and human assessment in subjective scenarios, and evaluation methodologies for systems that combine multiple IR paradigms (search, recommendations, and dialogue).