GlotEval-HumanEval documentation

Introduction

Evaluating large language models, especially for low-resource languages, remains challenging due to fragmented benchmarks focused on high-resource languages. GlotEval addresses this by offering an Automatic Evaluation Toolkit, a suite of tools for standardized LLM performance assessment that supports custom evaluation pipelines across multiple tasks, benchmarks, and metrics.

GlotEval-HumanEval is a component of GlotEval that presents toolkit results through a user-friendly interface, enabling horizontal comparisons via charts and human evaluations.