GlotEval-HumanEval documentation
Introduction
Evaluating large language models, especially for low-resource languages, remains challenging due to fragmented benchmarks focused on high-resource languages. GlotEval addresses this by offering an Automatic Evaluation Toolkit, a suite of tools for standardized LLM performance assessment that supports custom evaluation pipelines across multiple tasks, benchmarks, and metrics.
GlotEval-HumanEval is a component of GlotEval that presents toolkit results through a user-friendly interface, enabling horizontal comparisons via charts and human evaluations.
Supported tasks and benchmarks
Text Classification: SIB-200 and Taxi-1500.
Machine Translation: Flores200.
Summarization: XL-Sum
Open-ended Generation: Aya, PolyWrite
Machine Comprehension: BELEBELE, arc_multilingual
Intrinsic Evaluation: glot500, pbc
🌟Key Features
Compare the results across benchmarks and languages
Comparative Metrics
View the output results
Give feedback on the output results
Frameworks
We are using the following frameworks and libraries:
React
Material UI
Python Flask
SQLite
User Guide
System Designs