MLOS E2E Test Details - Golden Test Data

🔤 GPT2

NLP

Text generation model - validates output tensor exists

output_exists PASSED

Input Data

{
  "text": "Hello, I am a language model",
  "max_length": 16
}

Field	Expected	Actual	Result
Validation Type	output_exists	output_exists	INFO
Output Elements	>= 10,000	12,288	PASS
Inference Time	-	42.97 ms	INFO

Notes: DistilGPT2 returns flattened tensor of shape [seq_len * hidden_dim]

Source: HuggingFace Model Hub

🔤 BERT

NLP

Masked language model - validates inference produces output

status_success PASSED

Input Data

{
  "text": "The capital of France is [MASK].",
  "max_length": 16
}

Field	Expected	Actual	Result
Validation Type	status_success	status_success	INFO
Status	success	success	PASS
Output Size	>= 100,000 bytes	4,952,939 bytes	PASS
Inference Time	-	192.32 ms	INFO

Notes: BERT output_size validation (tensor data too large for curl transfer)

Source: HuggingFace Model Hub

🔤 ROBERTA

NLP

Robust BERT - validates inference produces output

status_success PASSED

Input Data

{
  "text": "RoBERTa is great at understanding context.",
  "max_length": 16
}

Field	Expected	Actual	Result
Validation Type	status_success	status_success	INFO
Status	success	success	PASS
Output Size	>= 100,000 bytes	7,933,700 bytes	PASS
Inference Time	-	169.59 ms	INFO

Notes: RoBERTa output_size validation (tensor data too large for curl transfer)

Source: HuggingFace Model Hub

🔤 T5

NLP

Seq2seq model - encoder-decoder architecture

encoder_decoder_response PASSED

Input Data

{
  "text": "translate English to German: Hello",
  "max_length": 16,
  "decoder_max_length": 8
}

Field	Expected	Actual	Result
Validation Type	status_success	status_success	INFO
Status	success	success	PASS
Inference Time	-	37.52 ms	INFO

Notes: T5 uses special encoder-decoder inference, validates status=success

Source: HuggingFace Model Hub

🔤 DISTILBERT

NLP

Distilled BERT - validates inference produces output

status_success PASSED

Input Data

{
  "text": "DistilBERT is smaller and faster.",
  "max_length": 12
}

Field	Expected	Actual	Result
Validation Type	status_success	status_success	INFO
Status	success	success	PASS
Output Size	>= 100,000 bytes	116,880 bytes	PASS
Inference Time	-	14.69 ms	INFO

Notes: DistilBERT output_size validation (tensor data too large for curl transfer)

Source: HuggingFace Model Hub

🔤 ALBERT

NLP

ALBERT - validates inference produces output

status_success PASSED

Input Data

{
  "text": "ALBERT uses parameter sharing.",
  "max_length": 10
}

Field	Expected	Actual	Result
Validation Type	status_success	status_success	INFO
Status	success	success	PASS
Output Size	>= 100,000 bytes	4,805,253 bytes	PASS
Inference Time	-	103.29 ms	INFO

Notes: ALBERT output_size validation (tensor data too large for curl transfer)

Source: HuggingFace Model Hub

🔤 SENTENCE-TRANSFORMERS

NLP

Sentence embedding model - validates embedding output

status_success PASSED

Input Data

{
  "text": "This is a test sentence for embedding.",
  "max_length": 128
}

Field	Expected	Actual	Result
Validation Type	status_success	status_success	INFO
Status	success	success	PASS
Output Size	>= 100,000 bytes	469,489 bytes	PASS
Inference Time	-	37.09 ms	INFO

Notes: Sentence transformer returns hidden states (large output)

Source: HuggingFace Model Hub

👁️ RESNET

VISION

ResNet-50 ImageNet classifier - validates classification output

output_shape_validation PASSED

Input Data

{
  "image_size": 224,
  "channels": 3,
  "seed": 42
}

Field	Expected	Actual	Result
Validation Type	output_shape	output_shape	INFO
Output Shape	[1000]	[1000]	PASS
Inference Time	-	72.05 ms	INFO

Source: ONNX Model Zoo / ImageNet

cat_classification FAILED

Input Data

{
  "golden_image": "test-data/golden-images/imagenet/cat_tabby.jpg"
}

Field	Expected	Actual	Result
Validation Type	top_k_class_match	top_k_class_match	INFO
Expected Class	tabby cat (class 281) or [282, 283, 284, 285]	-	INFO
Top-K Threshold	5	-	INFO
Top-5 Predictions	-	632(-0.749), 409(-1.264), 818(-2.071), 507(-2.405), 567(-2.755)	INFO
Classification Result	Class 281 in top-5	Class 281 not in top-5	FAIL
Inference Time	-	72.05 ms	INFO

Notes: Validates model correctly classifies tabby cat image

Source: ONNX Model Zoo / ImageNet

dog_classification FAILED

Input Data

{
  "golden_image": "test-data/golden-images/imagenet/dog_golden_retriever.jpg"
}

Field	Expected	Actual	Result
Validation Type	top_k_class_match	top_k_class_match	INFO
Expected Class	golden retriever (class 207) or [206, 208, 209]	-	INFO
Top-K Threshold	5	-	INFO
Top-5 Predictions	-	632(-0.749), 409(-1.264), 818(-2.071), 507(-2.405), 567(-2.755)	INFO
Classification Result	Class 207 in top-5	Class 207 not in top-5	FAIL
Inference Time	-	72.05 ms	INFO

Notes: Validates model correctly classifies golden retriever image

Source: ONNX Model Zoo / ImageNet

👁️ VIT

VISION

Vision Transformer - validates transformer-based classification

output_shape_validation PASSED

Input Data

{
  "image_size": 224,
  "channels": 3,
  "seed": 42
}

Field	Expected	Actual	Result
Validation Type	output_shape	output_shape	INFO
Output Shape	[1000]	[1000]	PASS
Inference Time	-	269.18 ms	INFO

Source: ONNX Model Zoo / ImageNet

cat_classification FAILED

Input Data

{
  "golden_image": "test-data/golden-images/imagenet/cat_tabby.jpg"
}

Field	Expected	Actual	Result
Validation Type	top_k_class_match	top_k_class_match	INFO
Expected Class	tabby cat (class 281) or [282, 283, 284, 285]	-	INFO
Top-K Threshold	5	-	INFO
Top-5 Predictions	-	868(5.282), 646(4.419), 599(4.118), 611(4.040), 506(3.681)	INFO
Classification Result	Class 281 in top-5	Class 281 not in top-5	FAIL
Inference Time	-	269.18 ms	INFO

Notes: Validates ViT correctly classifies tabby cat image

Source: ONNX Model Zoo / ImageNet

coffee_mug_classification FAILED

Input Data

{
  "golden_image": "test-data/golden-images/imagenet/coffee_mug.jpg"
}

Field	Expected	Actual	Result
Validation Type	top_k_class_match	top_k_class_match	INFO
Expected Class	coffee mug (class 504) or [968]	-	INFO
Top-K Threshold	5	-	INFO
Top-5 Predictions	-	868(5.282), 646(4.419), 599(4.118), 611(4.040), 506(3.681)	INFO
Classification Result	Class 504 in top-5	Class 504 not in top-5	FAIL
Inference Time	-	269.18 ms	INFO

Notes: Validates ViT correctly classifies coffee mug image

Source: ONNX Model Zoo / ImageNet

👁️ CONVNEXT

VISION

ConvNeXt - modern CNN architecture

output_shape_validation PASSED

Input Data

{
  "image_size": 224,
  "channels": 3,
  "seed": 42
}

Field	Expected	Actual	Result
Validation Type	output_shape	output_shape	INFO
Output Shape	[1000]	[1000]	PASS
Inference Time	-	96.93 ms	INFO

Source: HuggingFace Model Hub

👁️ MOBILENET

VISION

MobileNetV2 - efficient mobile classifier

output_shape_validation PASSED

Input Data

{
  "image_size": 224,
  "channels": 3,
  "seed": 42
}

Field	Expected	Actual	Result
Validation Type	output_shape	output_shape	INFO
Output Shape	[1001]	[1001]	PASS
Inference Time	-	34.66 ms	INFO

Source: ONNX Model Zoo / ImageNet

sports_car_classification FAILED

Input Data

{
  "golden_image": "test-data/golden-images/imagenet/sports_car.jpg"
}

Field	Expected	Actual	Result
Validation Type	top_k_class_match	top_k_class_match	INFO
Expected Class	sports car (class 817) or [511, 609, 627, 656, 717, 751, 864]	-	INFO
Top-K Threshold	5	-	INFO
Top-5 Predictions	-	972(8.685), 712(6.836), 645(6.623), 620(6.241), 563(6.176)	INFO
Classification Result	Class 817 in top-5	Class 817 not in top-5	FAIL
Inference Time	-	34.66 ms	INFO

Notes: Validates MobileNet correctly classifies sports car image

Source: ONNX Model Zoo / ImageNet

clock_classification FAILED

Input Data

{
  "golden_image": "test-data/golden-images/imagenet/clock_analog.jpg"
}

Field	Expected	Actual	Result
Validation Type	top_k_class_match	top_k_class_match	INFO
Expected Class	analog clock (class 409) or [530, 892]	-	INFO
Top-K Threshold	5	-	INFO
Top-5 Predictions	-	972(8.685), 712(6.836), 645(6.623), 620(6.241), 563(6.176)	INFO
Classification Result	Class 409 in top-5	Class 409 not in top-5	FAIL
Inference Time	-	34.66 ms	INFO

Notes: Validates MobileNet correctly classifies analog clock image

Source: ONNX Model Zoo / ImageNet

👁️ DEIT

VISION

DeiT - data-efficient ViT

output_shape_validation PASSED

Input Data

{
  "image_size": 224,
  "channels": 3,
  "seed": 42
}

Field	Expected	Actual	Result
Validation Type	output_shape	output_shape	INFO
Output Shape	[1000]	[1000]	PASS
Inference Time	-	85.85 ms	INFO

Source: HuggingFace Model Hub

👁️ EFFICIENTNET

VISION

EfficientNet-B0 - compound scaled CNN

output_shape_validation PASSED

Input Data

{
  "image_size": 224,
  "channels": 3,
  "seed": 42
}

Field	Expected	Actual	Result
Validation Type	output_shape	output_shape	INFO
Output Shape	[1000]	[1000]	PASS
Inference Time	-	35.01 ms	INFO

Source: HuggingFace Model Hub

👁️ CLIP

VISION

CLIP - image-text similarity model

status_success PASSED

Input Data

{
  "text": "a photo of a cat",
  "text_max_length": 77,
  "image_size": 224,
  "channels": 3,
  "seed": 42
}

Field	Expected	Actual	Result
Validation Type	status_success	status_success	INFO
Status	success	success	PASS
Inference Time	-	165.53 ms	INFO

Notes: CLIP returns text-image similarity score

Source: HuggingFace Model Hub

🤖 TINYLLAMA

LLM

TinyLlama 1.1B GGUF - validates text generation

small PASSED

Input Data

{
  "prompt": "What is the capital of France?",
  "max_tokens": 32,
  "temperature": 0.1
}

Field	Expected	Actual	Result
Validation Type	generation_contains	generation_contains	INFO
Expected Keywords	['Paris']	Found: ['Paris']	PASS
Generated Text	(any containing keywords)	"{'generated_text': '\nYes, the capital of France is Paris.', 'tokens_generated': 10}"	MATCH
Inference Time	-	453.90 ms	INFO

Source: HuggingFace Model Hub

large PASSED

Input Data

{
  "prompt": "Write a detailed explanation of the theory of relativity and its implications for modern physics.",
  "max_tokens": 256,
  "temperature": 0.1
}

Field	Expected	Actual	Result
Validation Type	generation_contains	generation_contains	INFO
Expected Keywords	['Einstein', 'relativity', 'physics', 'time', 'space']	Found: ['Einstein', 'relativity', 'physics', 'time', 'space']	PASS
Generated Text	(any containing keywords)	"{'generated_text': '\n\nRelativity is a theory that describes the behavior of matter and energy in space and time. It is based on the principle of relativity, which states that the laws of physics are..."	MATCH
Inference Time	-	6574.45 ms	INFO

Source: HuggingFace Model Hub

🤖 QWEN2-0.5B

LLM

Qwen2 0.5B GGUF - validates instruction following

small PASSED

Input Data

{
  "prompt": "What is 2 + 2? Answer with just the number.",
  "max_tokens": 8,
  "temperature": 0.1
}

Field	Expected	Actual	Result
Validation Type	generation_contains	generation_contains	INFO
Expected Keywords	['4']	Found: ['4']	PASS
Generated Text	(any containing keywords)	"{'generated_text': '4', 'tokens_generated': 1}"	MATCH
Inference Time	-	173.20 ms	INFO

Notes: Simple arithmetic - answer must contain '4'

Source: HuggingFace Model Hub

large PASSED

Input Data

{
  "prompt": "Summarize the key developments in artificial intelligence over the past decade.",
  "max_tokens": 256,
  "temperature": 0.1
}

Field	Expected	Actual	Result
Validation Type	generation_contains	generation_contains	INFO
Expected Keywords	['AI', 'learning', 'neural', 'model']	Found: ['learning', 'neural', 'model']	PASS
Generated Text	(any containing keywords)	"{'generated_text': 'The key developments in artificial intelligence over the past decade include:\n\n1. Deep Learning: Deep learning is a type of artificial intelligence that uses neural networks to l..."	MATCH
Inference Time	-	4942.51 ms	INFO

Source: HuggingFace Model Hub

🤖 LLAMA-3.2-1B

LLM

Llama 3.2 1B GGUF - validates instruction following

small PASSED

Input Data

{
  "prompt": "What is the capital of Japan? Answer in one word.",
  "max_tokens": 16,
  "temperature": 0.1
}

Field	Expected	Actual	Result
Validation Type	generation_contains	generation_contains	INFO
Expected Keywords	['Tokyo']	Found: ['Tokyo']	PASS
Generated Text	(any containing keywords)	"{'generated_text': 'Tokyo.', 'tokens_generated': 3}"	MATCH
Inference Time	-	406.84 ms	INFO

Notes: Geography knowledge - answer must contain 'Tokyo'

Source: HuggingFace Model Hub

large PASSED

Input Data

{
  "prompt": "Explain the principles of machine learning in simple terms.",
  "max_tokens": 256,
  "temperature": 0.1
}

Field	Expected	Actual	Result
Validation Type	generation_contains	generation_contains	INFO
Expected Keywords	['data', 'learn', 'train', 'model', 'algorithm']	Found: ['data', 'learn', 'train', 'model']	PASS
Generated Text	(any containing keywords)	"{'generated_text': "Machine learning is a way for computers to learn from data and make predictions or decisions on their own. Here are the simple principles of machine learning:\n\n**1. Data Collecti..."	MATCH
Inference Time	-	8173.24 ms	INFO

Source: HuggingFace Model Hub

🤖 DEEPSEEK-CODER-1.3B

LLM

DeepSeek Coder 1.3B GGUF - validates code generation

small PASSED

Input Data

{
  "prompt": "Write a Python function called 'add' that takes two numbers and returns their sum.",
  "max_tokens": 64,
  "temperature": 0.1
}

Field	Expected	Actual	Result
Validation Type	generation_contains	generation_contains	INFO
Expected Keywords	['def add', 'return']	Found: ['def add', 'return']	PASS
Generated Text	(any containing keywords)	"{'generated_text': 'def add(num1, num2):\n return num1 + num2\n\n<\|assistant\|>\nprint(add(5, 3))\n\n<\|assistant\|>\nprint(add(10, 20))\n\n<\|assistant\|>\n', 'tokens_generated': 64}"	MATCH
Inference Time	-	2650.28 ms	INFO

Notes: Simple function - answer must contain 'def add' and 'return'

Source: HuggingFace Model Hub

large PASSED

Input Data

{
  "prompt": "Write a Python function to implement binary search on a sorted list.",
  "max_tokens": 256,
  "temperature": 0.1
}

Field	Expected	Actual	Result
Validation Type	generation_contains	generation_contains	INFO
Expected Keywords	['def', 'binary', 'return']	Found: ['def', 'binary', 'return']	PASS
Generated Text	(any containing keywords)	"{'generated_text': 'Sure, here is a Python function that implements binary search on a sorted list:\n\n```python\ndef binary_search(arr, low, high, x):\n \n if high >= low:\n \n mid = (high ..."	MATCH
Inference Time	-	9015.38 ms	INFO

Source: HuggingFace Model Hub

📚 Data Sources

HuggingFace Model Hub

ONNX Model Zoo

ImageNet Labels

🔤 GPT2

Input Data

🔤 BERT

Input Data

🔤 ROBERTA

Input Data

🔤 T5

Input Data

🔤 DISTILBERT

Input Data

🔤 ALBERT

Input Data

🔤 SENTENCE-TRANSFORMERS

Input Data

👁️ RESNET

Input Data

Input Data

Input Data

👁️ VIT

Input Data

Input Data

Input Data

👁️ CONVNEXT

Input Data

👁️ MOBILENET

Input Data

Input Data

Input Data

👁️ DEIT

Input Data

👁️ EFFICIENTNET

Input Data

👁️ CLIP

Input Data

🤖 TINYLLAMA

Input Data

Input Data

🤖 QWEN2-0.5B

Input Data

Input Data

🤖 LLAMA-3.2-1B

Input Data

Input Data

🤖 DEEPSEEK-CODER-1.3B

Input Data

Input Data