Accepted Papers
- One-Shot Labeling for Automatic Relevance Estimation
- Sean MacAvaney and Luca Soldaini
- Evaluating Cross-modal Generative Models Using Retrieval Task
- Shivangi Bithel and Srikanta Bedathur
- A Comparison of Methods for Evaluating Generative IR
- Negar Arabzadeh and Charles L. A. Clarke
- A Novel Evaluation Framework for Image2Text Generation
- Jia-Hong Huang, Hongyi Zhu, Yixian Shen, Stevan Rudinac, Alessio M. Pacces and Evangelos Kanoulas
- Using LLMs to Investigate Correlations of Conversational Follow-up Queries with User Satisfaction
- Hyunwoo Kim, Yoonseo Choi, Taehyun Yang, Honggu Lee, Chaneon Park, Yongju Lee, Jin Young Kim and Juho Kim
- EXAM++: LLM-based Answerability Metrics for IR Evaluation
- Naghmeh Farzi and Laura Dietz
- Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems
- Clemencia Siro, Mohammad Aliannejadi and Maarten de Rijke
- On the Evaluation of Machine-Generated Reports
- James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason and Noah Hibbler
- Toward Automatic Relevance Judgment using Vision–Language Models for Image–Text Retrieval Evaluation
- Jheng-Hong Yang and Jimmy Lin
- Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.
- Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang and Michael Bendersky
- Selective Fine-tuning on LLM-labeled Data May Reduce Reliance on Human Annotation: A Case Study Using Schedule-of-Event Table Detection
- Bhawesh Kumar, Jonathan Amar, Eric Yang, Nan Li and Yugang Jia
- FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
- Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie and Luca Soldaini
- The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches
- Bhashithe Abeysinghe and Ruhan Circi
- Can We Use Large Language Models to Fill Relevance Judgment Holes?
- Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi and Mohammad Aliannejadi
- Query Performance Prediction using Relevance Judgments Generated by Large Language Model
- Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi and Maarten de Rijke
- Large Language Models for Relevance Judgment in Product Search
- Navid Mehrdad, Hrushikesh Mohapatra, Mossaab Bagdouri, Prijith Chandran, Alessandro Magnani, Xunfan Cai, Ajit Puthenputhussery, Sachin Yadav, Tony Lee, Chengxiang Zhai and Ciya Liao
- Evaluating the Retrieval Component in LLM-Based Question Answering Systems
- Ashkan Alinejad, Krtin Kumar and Ali Vahdat
- Exploring Large Language Models for Relevance Judgments in Tetun
- Gabriel de Jesus and Sérgio Nunes
- A Comparative Analysis of Faithfulness Metrics and Humans in Citation Evaluation
- Weijia Zhang, Mohammad Aliannejadi, Jiahuan Pei, Yifei Yuan, Jia-Hong Huang and Evangelos Kanoulas
- Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework
- Zackary Rackauckas, Arthur Câmara and Jakub Zavrel
- Enhancing Demographic Diversity in Test Collections Using LLMs
- Marwah Alaofi, Nicola Ferro, Paul Thomas, Falk Scholer and Mark Sanderson
- GPT-4 Relevance Labelling can be Fooled by Query Keyword Stuffing
- Marwah Alaofi, Paul Thomas, Falk Scholer and Mark Sanderson