Program

LLM4Eval is colocated with SIGIR 2024 in Washington D.C., USA and takes place on July 18, 2024. All times in the table below are according to the local time zone.

Time Agenda
9:00 - 9:15 Opening Remarks
9:15 - 10:00 Keynote 1: Ian Soboroff, NIST
10:00 - 10:30 Booster Talks 1
10:30 - 11:00 Coffee Break
11:00 - 11:30 Booster Talks 2
11:30 - 12:30 Poster Session
12:30 - 13:30 Lunch
13:30 - 14:15 Keynote 2: Donald Metzler, Google
14:15 - 14:30 LLMJudge Presentation
14:30 - 14:40 Discussion Kickoff
14:40 - 15:00 Breakout Discussions
15:00 - 15:30 Coffee Break
15:30 - 15:50 Breakout Discussion + Shuffling
15:50 - 16:00 Breakout Discussion summary
16:00 - 16:55 Panel Discussion
16:55 - 17:00 Closing

Keynotes

A Brief History of Automatic Evaluation in IR

Ian Soboroff, National Institute of Standards and Technology (NIST)

Abstract. The ability of large language models such as GPT4 to respond to natural language instructions with flowing, grammatical text that reflects world knowledge has generated (sorry) significant interest in IR, as it has everywhere, and specifically in the area of IR evaluation. It seems that just as we “prompt” a human assessor to provide a relevance judgment, we can do the same thing with an LLM. Researchers are very excited because the fluent, concise, informed, and perhaps even grounded responses from the LLM feel like interacting with a person, and so we guess they might have some of the same capabilities beyond producing fluent textual responses to prompts. In IR we are always complaining about the costs of human assessments, so perhaps this is solved. I would like to point out, although it is not the main thrust of this talk, that if the above is true, IR is solved and we don’t need to have research about it any more. The computer understands the document and the user information need to the degree that it can accurately predict if the document meets the need, and that is what IR systems are supposed to do. Scaling current LLM capabilities to where it can run on your wristwatch is just engineering. The actual thrust of this talk will be to review some of the history and literature on automatic evaluation methods. This is not automatic evaluation’s first rodeo, as they say. My arrival at NIST was accompanied by a SIGIR paper proposing that relevant documents could be picked using random sampling, and from that point the race was on. Along the way we have reinforced some things we already knew, like relevance feedback is good, and found some new things we didn’t know.

Bio. Ian Soboroff is a computer scientist and leader of the Retrieval Group at the National Institute of Standards and Technology (NIST). The Retrieval Group organizes the Text REtrieval Conference (TREC), the Text Analysis Conference (TAC), and the TREC Video Retrieval Evaluation (TRECVID). These are all large, community-based research workshops that drive the state-of-the-art in information retrieval, video search, web search, information extraction, text summarization and other areas of information access. He has co-authored many publications in information retrieval evaluation, test collection building, text filtering, collaborative filtering, and intelligent software agents. His current research interests include building test collections for social media environments and nontraditional retrieval tasks.

LLMs as Rankers, Raters, and Rewarders

Donald Metzler, Google DeepMind

Abstract. In this talk, I will discuss recent advancements in the application of large language models (LLMs) to ranking, rating, and reward modeling, particularly in the context of information retrieval tasks. I will emphasize the fundamental similarities among these problems, highlighting that they essentially address the same underlying issue but through different approaches. Based on this observation, I propose several research questions that offer promising avenues for future exploration.

Bio. Donald Metzler is a Senior Staff Research Scientist at Google Inc. Prior to that, he was a Research Assistant Professor at the University of Southern California (USC) and a Senior Research Scientist at Yahoo!. He has served as the Program Chair of the WSDM, ICTIR, and OAIR conferences and sat on the editorial boards of all the major journals in his field. He has published over 100 research papers, has been awarded 9 patents, and is a co-author of “Search Engines: Information Retrieval in Practice”. He currently leads a research group focused on a variety of problems at the intersection of machine learning, natural language processing, and information retrieval.

Panelists

Charles Clarke, University of Waterloo

Ian Soboroff, National Institute of Standards and Technology (NIST)

Laura Dietz, University of New Hampshire

Michael Ekstrand, Drexel University