GlotEval-HumanEval documentation

Introduction

Evaluating large language models, especially for low-resource languages, remains challenging due to fragmented benchmarks focused on high-resource languages. GlotEval addresses this by offering an Automatic Evaluation Toolkit, a suite of tools for standardized LLM performance assessment that supports custom evaluation pipelines across multiple tasks, benchmarks, and metrics.

GlotEval-HumanEval is a component of GlotEval that presents toolkit results through a user-friendly interface, enabling horizontal comparisons via charts and human evaluations.

Supported tasks and benchmarks

  • Text Classification: SIB-200 and Taxi-1500.

  • Machine Translation: Flores200.

  • Summarization: XL-Sum

  • Open-ended Generation: Aya, PolyWrite

  • Machine Comprehension: BELEBELE, arc_multilingual

  • Intrinsic Evaluation: glot500, pbc

🌟Key Features

Compare the results across benchmarks and languages

_images/data-visualization.png

Comparative Metrics

_images/Praneeth.png

View the output results

_images/datatable-translated.png

Give feedback on the output results

_images/Eval1by1.png

Frameworks

We are using the following frameworks and libraries:

React

React

Material UI

Material UI

Python Flask

Python Flask

Sqlite

SQLite