Microsoft Introduces ASSERT: A Framework for Testing AI Behavior

Key Details

ASSERT converts plain-language descriptions into structured sets of acceptable and unacceptable behaviors.

Microsoft has released ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) , an open-source framework designed to evaluate application-specific AI behavior. The tool allows developers to use natural-language descriptions of expected behaviors and policies to generate structured test cases, run them against AI systems, and score the results.

The framework operates through several key functions:

Generates problem scenarios and test cases based on plain-language rules
Runs tests against the target system and scores the results
Records intermediate actions and tool calls for detailed failure inspection
Allows customization through system context, tools, and constraints

Example: A developer could specify rules for a document research AI agent, and ASSERT would automatically generate tests to check compliance with those rules.

Context

"Evaluations are critical for understanding whether AI systems meet organizational standards."

According to Microsoft's Sarah Bird, chief product officer of Responsible AI, application-specific evaluations are essential for building trustworthy systems. ASSERT can be deployed during development, after deployment, and for continuous monitoring.

The release aligns with broader industry efforts on repeatable testing and regression checks, including initiatives from:

Stanford's HELM
MLCommons' AILuminate
METR

Hey There!

Microsoft releases ASSERT framework for testing AI behavior

Microsoft Introduces ASSERT: A Framework for Testing AI Behavior

Key Details

Context