LLMs for Relevance Judgments
The LLMJudge challenge is organized as part of the LLM4Eval workshop at SIGIR 2024. Test collections are essential for evaluating information retrieval (IR) systems. Evaluating and tuning a search system largely depends on relevance labels, which indicate whether a document is useful for a specific search and user. However, collecting relevance judgments on a large scale is costly and resource-intensive. Consequently, typical experiments rely on third-party labellers who may not always produce accurate annotations. The LLMJudge challenge aims to explore an alternative approach by using LLMs to generate relevance judgments. Recent studies have shown that LLMs can generate reliable relevance judgments for search systems. Nevertheless, it remains unclear which LLMs can match the accuracy of human labellers, which prompts are most effective, how fine-tuned open-source LLMs compare to closed-source LLMs like GPT-4, whether there are biases in synthetically generated data, and if data leakage affects the quality of generated labels. This challenge will investigate these questions, and the collected data will be released as a package to support automatic relevance judgment research in information retrieval and search.
The challenge will be, given the query and document as input, how they are relevant. Here, we use a four-point scale judgments to evaluate the relevance of the query to document as follow:
The task is, by providing to the participants the datasets that include quereis, documents, and query-document file, to ask LLMs to generate a score [0, 1, 2, 3]
indicating the relevancy of query to document.
Below we list the files for the challenge:
./data/llm4eval_document_2024.jsonl
is a JSONL file consisting of document IDs and documents (passages)../data/llm4eval_query_2024.txt
is a TXT file consisting of topic ID’s, as well as queries (text)../data/llm4eval_dev_qrel_2024.txt
is a TXT file consisting of train/dev topic ID’s and document IDs with relevance labels. It can be used for training, fine-tuning, or in-context learning../data/llm4eval_test_qrel_2024.txt
is a TXT file consisting of test topic ID’s and document IDs. This is the file that the participants must use to predict the relevance judgments of each query to document and submit as the final result file.Note that all TXT files are tab-separated.
Participants’ results will then be compared in two ways after submission:
We will use Google Forms for submissions. Submissions are open at https://forms.gle/SmbW5nYZ89gowBN17. We provide a step-by-step Google Form to submit the detailed submission files. Please do not hesitate to contact us in case of questions and/or problems. The final results file should be formatted similarly to the challenge test file including one extra column for the LLM-generated labels for each sample:
0
1. (Model Usage) Are we allowed to use different models for our submissions, or is there a specific model that we must use?
You can use any models you prefer. There are no restrictions on the models you can use for your submissions.
2. (Prompt Flexibility) Can we change or customise the provided prompt, or must we use the exact prompts as specified in the challenge guidelines?
You can use any prompts you wish. The provided prompt is merely a sample from the paper by Thomas et al. [1].
[1] Thomas, Paul, Seth Spielman, Nick Craswell, and Bhaskar Mitra. “Large language models can accurately predict searcher preferences.” arXiv preprint arXiv:2309.10621 (2023).
3. (Submission Limits) How many runs are we permitted to submit for evaluation? Is there a limit to the number of submissions per team or model?
There are no limits on the number of submissions. You can submit as many runs as you want. We also encourage running and submitting the same model and prompt multiple times for reproducibility purposes.
4. (Query Format Clarification) Regarding the format
0 , we understand that the 0 serves as a separator. Can you please confirm if this is correct?
Yes, the 0 serves as a separator.
Here are the answers to the new questions in the same format:
5. (Challenge Goal) I am a bit unclear about the main goal of the challenge. Is it about finding the best evaluation prompt, finding the best public LLM model, fine-tuning an LLM model for judging purposes, or all of the above? Clarity here will help in the experiment setup.
The main goal of the challenge encompasses all of the above: finding the best evaluation prompt, identifying the best public LLM model, and fine-tuning an LLM model for judging purposes. Participants are encouraged to explore and innovate across these aspects.
6. (Reproducibility Factors) To ensure reproducibility and comparison between different submissions, what are the set of factors which should be kept unchanged?
To ensure reproducibility and facilitate comparison between different submissions, the following factors should be kept unchanged:
7. (Competition and Winners) Does the challenge have a contest i.e., will there be winners announced at the end of the competition?
Yes, the top-performing teams/submissions will be announced at the workshop.
The challenge is organized as a joint effort by the University College London, Microsoft, University of Amsterdam, University of Waterloo, and University of Padua.