1. Choose items to be evaluated (scroll).
Budget: 5 items left to select.
2. Models translate selected items.
3. Humans assess translation quality.
4. See model ranking on your selection.
Click reveal to compare to the true model ranking on all items.
Finished? You just simulated (on real WMT 2024 English🠚Czech dataset) what the subset2evaluate tool does!
The goal is to find such a subset of items that ranks the models the same way as the full evaluation would.
Read the paper
How to Select Datapoints for Efficient Human Evaluation of NLG Models? or just use the
tool for your NLG evaluation.
Cite as:
@misc{zouhar2025selectdatapointsefficienthuman,
title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?},
author={Vilém Zouhar and Peng Cui and Mrinmaya Sachan},
year={2025},
eprint={2501.18251},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.18251},
}