Building High-Quality AI Evaluation Datasets - A Practical Guide from Zero to One

Mar 13, 2025

中文: 如何构建高质量AI评估数据集 - 从0到1实践指南

Table of Contents

Introduction #

Recently, while working on our company’s conversational data analysis AI project, I’ve encountered numerous challenges that reinforced the truth of “no testing, no development.” From my initial confusion when first encountering evaluation frameworks to now being able to systematically build evaluation systems, I’ve certainly taken many detours.

This article will focus on sharing experiences in preparing evaluation datasets, hoping to help others exploring the AI evaluation field avoid unnecessary pitfalls. For detailed reviews of evaluation frameworks, please check out the evaluation series articles on this site.

Dataset Format Design #

For beginners, I strongly recommend using Excel to prepare datasets - it’s simple and efficient. The recommended format for a standard evaluation data entry is:

input, output, metadata

The specific content of these three fields will vary depending on different evaluation objectives. Taking our company’s data analysis Q&A bot (similar to ChatGPT for data analysis) as an example, when evaluating user conversation satisfaction:

input: User questions, which may include contextual information. When context is included, it’s actually testing multi-turn conversation capabilities.
output: Standard answers to questions, which may include analysis data obtained through tool calls.
metadata: Relevant contextual information, such as language, data source, difficulty, and other label information.

Evaluators will score and analyze based on this information.

Data Collection Strategy #

At this stage, we mainly collect questions, i.e., input. Data collection generally occurs in two phases:

1. Development Phase #

Before the product launches, we collect data primarily through:

Simulating potential user questions based on product requirement documents
Having team members role-play as users asking questions
Using AI to help design diverse test questions

2. Post-Launch Phase #

Once the product is live, we begin collecting data from real users:

Extracting actual user questions from logs
Focusing on questions with poor user feedback

Data Classification System #

Data classification is crucial for comprehensive system evaluation. Based on our scenario, we divide test data into three major categories:

Answerable Questions: Questions for which corresponding analysis data can be obtained. Users should receive high-quality answers to these questions. For example:
```
"How much did Q1 sales grow compared to the same period last year?"
```
Boundary Questions: Questions without corresponding analysis data. The system should clearly explain the reason to users. For example:
```
"What is our company's sales forecast for 2030?" (No future data in the database)
```
Irrelevant Questions: Questions unrelated to data analysis. The system should recognize and respond appropriately. For example:
```
"How's the weather today?"
"Can you write me a poem?"
```

In evaluation experiments, these classification details serve as important analytical dimensions.

Data Usage Process #

The evaluation process typically follows these steps:

Build evaluation tasks, sending prepared questions (input) to the AI agent
Record the AI’s answers as actual output
Simultaneously record relevant data obtained by the AI (we get backend data through tool call API methods)
Provide necessary contextual information based on different evaluator requirements
Execute the evaluation to obtain scores and analysis reports

Different evaluation dimensions may require different output content. For example, when evaluating analytical accuracy, corresponding data is needed as context; when evaluating response fluency, the focus is more on text quality.

Data Volume Planning #

Data volume planning should be determined based on project stage and evaluation objectives:

Project Stage	Recommended Data Volume	Description
Initial Development	5-10 entries per category	Quick iteration, identifying obvious issues
Internal Testing	20-50 entries per category	Comprehensive coverage of core scenarios
Official Release	100+ entries	Including various edge cases

Our experience is that quality over quantity is preferable. A well-designed small dataset is often more valuable than a disorganized large dataset.

Practical Tips #

Data Diversity: Ensure the dataset covers questions of different difficulties, domains, and expressions
Continuous Updates: Regularly update evaluation datasets based on user feedback and system updates
Cross-Validation: Have multiple people review the dataset to avoid personal bias

Summary and Outlook #

Building a high-quality evaluation dataset is an indispensable part of AI system development. Through systematically collecting, classifying, and using evaluation data, we can more objectively measure system performance, identify potential issues, and guide subsequent optimization directions.

Our team continues to explore and practice in this field, and we will continue to share more experiences and insights in the future. If you have any questions or suggestions, please feel free to contact us!

This is the first article in our AI evaluation series. We will share more about evaluation metric selection, evaluation framework comparisons, and other content in future articles. Stay tuned!