๐Ÿงช Test Details - Golden Test Data

Complete transparency on test inputs, expected outputs, and validation results

โ† Back to Main Report
22
Tests Passed
6
Tests Failed
18
Models Tested
28
Test Cases

๐Ÿ“š Data Sources

HuggingFace Model Hub

Model documentation, tokenizer configs, and example inputs from official model cards.

huggingface.co/models

ONNX Model Zoo

Validated ONNX models with test data and expected outputs for vision models.

github.com/onnx/models

ImageNet Labels

Standard ImageNet-1K class labels for vision model validation.

ImageNet synset.txt

๐Ÿ”ค GPT2

NLP

Text generation model - validates output tensor exists

output_exists PASSED
Input Data
{ "text": "Hello, I am a language model", "max_length": 16 }
Field Expected Actual Result
Validation Type output_exists output_exists INFO
Output Elements >= 10,000 12,288 PASS
Inference Time - 42.97 ms INFO
Notes: DistilGPT2 returns flattened tensor of shape [seq_len * hidden_dim]

Source: HuggingFace Model Hub

๐Ÿ”ค BERT

NLP

Masked language model - validates inference produces output

status_success PASSED
Input Data
{ "text": "The capital of France is [MASK].", "max_length": 16 }
Field Expected Actual Result
Validation Type status_success status_success INFO
Status success success PASS
Output Size >= 100,000 bytes 4,952,939 bytes PASS
Inference Time - 192.32 ms INFO
Notes: BERT output_size validation (tensor data too large for curl transfer)

Source: HuggingFace Model Hub

๐Ÿ”ค ROBERTA

NLP

Robust BERT - validates inference produces output

status_success PASSED
Input Data
{ "text": "RoBERTa is great at understanding context.", "max_length": 16 }
Field Expected Actual Result
Validation Type status_success status_success INFO
Status success success PASS
Output Size >= 100,000 bytes 7,933,700 bytes PASS
Inference Time - 169.59 ms INFO
Notes: RoBERTa output_size validation (tensor data too large for curl transfer)

Source: HuggingFace Model Hub

๐Ÿ”ค T5

NLP

Seq2seq model - encoder-decoder architecture

encoder_decoder_response PASSED
Input Data
{ "text": "translate English to German: Hello", "max_length": 16, "decoder_max_length": 8 }
Field Expected Actual Result
Validation Type status_success status_success INFO
Status success success PASS
Inference Time - 37.52 ms INFO
Notes: T5 uses special encoder-decoder inference, validates status=success

Source: HuggingFace Model Hub

๐Ÿ”ค DISTILBERT

NLP

Distilled BERT - validates inference produces output

status_success PASSED
Input Data
{ "text": "DistilBERT is smaller and faster.", "max_length": 12 }
Field Expected Actual Result
Validation Type status_success status_success INFO
Status success success PASS
Output Size >= 100,000 bytes 116,880 bytes PASS
Inference Time - 14.69 ms INFO
Notes: DistilBERT output_size validation (tensor data too large for curl transfer)

Source: HuggingFace Model Hub

๐Ÿ”ค ALBERT

NLP

ALBERT - validates inference produces output

status_success PASSED
Input Data
{ "text": "ALBERT uses parameter sharing.", "max_length": 10 }
Field Expected Actual Result
Validation Type status_success status_success INFO
Status success success PASS
Output Size >= 100,000 bytes 4,805,253 bytes PASS
Inference Time - 103.29 ms INFO
Notes: ALBERT output_size validation (tensor data too large for curl transfer)

Source: HuggingFace Model Hub

๐Ÿ”ค SENTENCE-TRANSFORMERS

NLP

Sentence embedding model - validates embedding output

status_success PASSED
Input Data
{ "text": "This is a test sentence for embedding.", "max_length": 128 }
Field Expected Actual Result
Validation Type status_success status_success INFO
Status success success PASS
Output Size >= 100,000 bytes 469,489 bytes PASS
Inference Time - 37.09 ms INFO
Notes: Sentence transformer returns hidden states (large output)

Source: HuggingFace Model Hub

๐Ÿ‘๏ธ RESNET

VISION

ResNet-50 ImageNet classifier - validates classification output

output_shape_validation PASSED
Input Data
{ "image_size": 224, "channels": 3, "seed": 42 }
Field Expected Actual Result
Validation Type output_shape output_shape INFO
Output Shape [1000] [1000] PASS
Inference Time - 72.05 ms INFO

Source: ONNX Model Zoo / ImageNet

cat_classification FAILED
Input Data
{ "golden_image": "test-data/golden-images/imagenet/cat_tabby.jpg" }
Field Expected Actual Result
Validation Type top_k_class_match top_k_class_match INFO
Expected Class tabby cat (class 281) or [282, 283, 284, 285] - INFO
Top-K Threshold 5 - INFO
Top-5 Predictions - 632(-0.749), 409(-1.264), 818(-2.071), 507(-2.405), 567(-2.755) INFO
Classification Result Class 281 in top-5 Class 281 not in top-5 FAIL
Inference Time - 72.05 ms INFO
Notes: Validates model correctly classifies tabby cat image

Source: ONNX Model Zoo / ImageNet

dog_classification FAILED
Input Data
{ "golden_image": "test-data/golden-images/imagenet/dog_golden_retriever.jpg" }
Field Expected Actual Result
Validation Type top_k_class_match top_k_class_match INFO
Expected Class golden retriever (class 207) or [206, 208, 209] - INFO
Top-K Threshold 5 - INFO
Top-5 Predictions - 632(-0.749), 409(-1.264), 818(-2.071), 507(-2.405), 567(-2.755) INFO
Classification Result Class 207 in top-5 Class 207 not in top-5 FAIL
Inference Time - 72.05 ms INFO
Notes: Validates model correctly classifies golden retriever image

Source: ONNX Model Zoo / ImageNet

๐Ÿ‘๏ธ VIT

VISION

Vision Transformer - validates transformer-based classification

output_shape_validation PASSED
Input Data
{ "image_size": 224, "channels": 3, "seed": 42 }
Field Expected Actual Result
Validation Type output_shape output_shape INFO
Output Shape [1000] [1000] PASS
Inference Time - 269.18 ms INFO

Source: ONNX Model Zoo / ImageNet

cat_classification FAILED
Input Data
{ "golden_image": "test-data/golden-images/imagenet/cat_tabby.jpg" }
Field Expected Actual Result
Validation Type top_k_class_match top_k_class_match INFO
Expected Class tabby cat (class 281) or [282, 283, 284, 285] - INFO
Top-K Threshold 5 - INFO
Top-5 Predictions - 868(5.282), 646(4.419), 599(4.118), 611(4.040), 506(3.681) INFO
Classification Result Class 281 in top-5 Class 281 not in top-5 FAIL
Inference Time - 269.18 ms INFO
Notes: Validates ViT correctly classifies tabby cat image

Source: ONNX Model Zoo / ImageNet

coffee_mug_classification FAILED
Input Data
{ "golden_image": "test-data/golden-images/imagenet/coffee_mug.jpg" }
Field Expected Actual Result
Validation Type top_k_class_match top_k_class_match INFO
Expected Class coffee mug (class 504) or [968] - INFO
Top-K Threshold 5 - INFO
Top-5 Predictions - 868(5.282), 646(4.419), 599(4.118), 611(4.040), 506(3.681) INFO
Classification Result Class 504 in top-5 Class 504 not in top-5 FAIL
Inference Time - 269.18 ms INFO
Notes: Validates ViT correctly classifies coffee mug image

Source: ONNX Model Zoo / ImageNet

๐Ÿ‘๏ธ CONVNEXT

VISION

ConvNeXt - modern CNN architecture

output_shape_validation PASSED
Input Data
{ "image_size": 224, "channels": 3, "seed": 42 }
Field Expected Actual Result
Validation Type output_shape output_shape INFO
Output Shape [1000] [1000] PASS
Inference Time - 96.93 ms INFO

Source: HuggingFace Model Hub

๐Ÿ‘๏ธ MOBILENET

VISION

MobileNetV2 - efficient mobile classifier

output_shape_validation PASSED
Input Data
{ "image_size": 224, "channels": 3, "seed": 42 }
Field Expected Actual Result
Validation Type output_shape output_shape INFO
Output Shape [1001] [1001] PASS
Inference Time - 34.66 ms INFO

Source: ONNX Model Zoo / ImageNet

sports_car_classification FAILED
Input Data
{ "golden_image": "test-data/golden-images/imagenet/sports_car.jpg" }
Field Expected Actual Result
Validation Type top_k_class_match top_k_class_match INFO
Expected Class sports car (class 817) or [511, 609, 627, 656, 717, 751, 864] - INFO
Top-K Threshold 5 - INFO
Top-5 Predictions - 972(8.685), 712(6.836), 645(6.623), 620(6.241), 563(6.176) INFO
Classification Result Class 817 in top-5 Class 817 not in top-5 FAIL
Inference Time - 34.66 ms INFO
Notes: Validates MobileNet correctly classifies sports car image

Source: ONNX Model Zoo / ImageNet

clock_classification FAILED
Input Data
{ "golden_image": "test-data/golden-images/imagenet/clock_analog.jpg" }
Field Expected Actual Result
Validation Type top_k_class_match top_k_class_match INFO
Expected Class analog clock (class 409) or [530, 892] - INFO
Top-K Threshold 5 - INFO
Top-5 Predictions - 972(8.685), 712(6.836), 645(6.623), 620(6.241), 563(6.176) INFO
Classification Result Class 409 in top-5 Class 409 not in top-5 FAIL
Inference Time - 34.66 ms INFO
Notes: Validates MobileNet correctly classifies analog clock image

Source: ONNX Model Zoo / ImageNet

๐Ÿ‘๏ธ DEIT

VISION

DeiT - data-efficient ViT

output_shape_validation PASSED
Input Data
{ "image_size": 224, "channels": 3, "seed": 42 }
Field Expected Actual Result
Validation Type output_shape output_shape INFO
Output Shape [1000] [1000] PASS
Inference Time - 85.85 ms INFO

Source: HuggingFace Model Hub

๐Ÿ‘๏ธ EFFICIENTNET

VISION

EfficientNet-B0 - compound scaled CNN

output_shape_validation PASSED
Input Data
{ "image_size": 224, "channels": 3, "seed": 42 }
Field Expected Actual Result
Validation Type output_shape output_shape INFO
Output Shape [1000] [1000] PASS
Inference Time - 35.01 ms INFO

Source: HuggingFace Model Hub

๐Ÿ‘๏ธ CLIP

VISION

CLIP - image-text similarity model

status_success PASSED
Input Data
{ "text": "a photo of a cat", "text_max_length": 77, "image_size": 224, "channels": 3, "seed": 42 }
Field Expected Actual Result
Validation Type status_success status_success INFO
Status success success PASS
Inference Time - 165.53 ms INFO
Notes: CLIP returns text-image similarity score

Source: HuggingFace Model Hub

๐Ÿค– TINYLLAMA

LLM

TinyLlama 1.1B GGUF - validates text generation

small PASSED
Input Data
{ "prompt": "What is the capital of France?", "max_tokens": 32, "temperature": 0.1 }
Field Expected Actual Result
Validation Type generation_contains generation_contains INFO
Expected Keywords ['Paris'] Found: ['Paris'] PASS
Generated Text (any containing keywords) "{'generated_text': '\nYes, the capital of France is Paris.', 'tokens_generated': 10}" MATCH
Inference Time - 453.90 ms INFO

Source: HuggingFace Model Hub

large PASSED
Input Data
{ "prompt": "Write a detailed explanation of the theory of relativity and its implications for modern physics.", "max_tokens": 256, "temperature": 0.1 }
Field Expected Actual Result
Validation Type generation_contains generation_contains INFO
Expected Keywords ['Einstein', 'relativity', 'physics', 'time', 'space'] Found: ['Einstein', 'relativity', 'physics', 'time', 'space'] PASS
Generated Text (any containing keywords) "{'generated_text': '\n\nRelativity is a theory that describes the behavior of matter and energy in space and time. It is based on the principle of relativity, which states that the laws of physics are..." MATCH
Inference Time - 6574.45 ms INFO

Source: HuggingFace Model Hub

๐Ÿค– QWEN2-0.5B

LLM

Qwen2 0.5B GGUF - validates instruction following

small PASSED
Input Data
{ "prompt": "What is 2 + 2? Answer with just the number.", "max_tokens": 8, "temperature": 0.1 }
Field Expected Actual Result
Validation Type generation_contains generation_contains INFO
Expected Keywords ['4'] Found: ['4'] PASS
Generated Text (any containing keywords) "{'generated_text': '4', 'tokens_generated': 1}" MATCH
Inference Time - 173.20 ms INFO
Notes: Simple arithmetic - answer must contain '4'

Source: HuggingFace Model Hub

large PASSED
Input Data
{ "prompt": "Summarize the key developments in artificial intelligence over the past decade.", "max_tokens": 256, "temperature": 0.1 }
Field Expected Actual Result
Validation Type generation_contains generation_contains INFO
Expected Keywords ['AI', 'learning', 'neural', 'model'] Found: ['learning', 'neural', 'model'] PASS
Generated Text (any containing keywords) "{'generated_text': 'The key developments in artificial intelligence over the past decade include:\n\n1. Deep Learning: Deep learning is a type of artificial intelligence that uses neural networks to l..." MATCH
Inference Time - 4942.51 ms INFO

Source: HuggingFace Model Hub

๐Ÿค– LLAMA-3.2-1B

LLM

Llama 3.2 1B GGUF - validates instruction following

small PASSED
Input Data
{ "prompt": "What is the capital of Japan? Answer in one word.", "max_tokens": 16, "temperature": 0.1 }
Field Expected Actual Result
Validation Type generation_contains generation_contains INFO
Expected Keywords ['Tokyo'] Found: ['Tokyo'] PASS
Generated Text (any containing keywords) "{'generated_text': 'Tokyo.', 'tokens_generated': 3}" MATCH
Inference Time - 406.84 ms INFO
Notes: Geography knowledge - answer must contain 'Tokyo'

Source: HuggingFace Model Hub

large PASSED
Input Data
{ "prompt": "Explain the principles of machine learning in simple terms.", "max_tokens": 256, "temperature": 0.1 }
Field Expected Actual Result
Validation Type generation_contains generation_contains INFO
Expected Keywords ['data', 'learn', 'train', 'model', 'algorithm'] Found: ['data', 'learn', 'train', 'model'] PASS
Generated Text (any containing keywords) "{'generated_text': "Machine learning is a way for computers to learn from data and make predictions or decisions on their own. Here are the simple principles of machine learning:\n\n**1. Data Collecti..." MATCH
Inference Time - 8173.24 ms INFO

Source: HuggingFace Model Hub

๐Ÿค– DEEPSEEK-CODER-1.3B

LLM

DeepSeek Coder 1.3B GGUF - validates code generation

small PASSED
Input Data
{ "prompt": "Write a Python function called 'add' that takes two numbers and returns their sum.", "max_tokens": 64, "temperature": 0.1 }
Field Expected Actual Result
Validation Type generation_contains generation_contains INFO
Expected Keywords ['def add', 'return'] Found: ['def add', 'return'] PASS
Generated Text (any containing keywords) "{'generated_text': 'def add(num1, num2):\n return num1 + num2\n\n<|assistant|>\nprint(add(5, 3))\n\n<|assistant|>\nprint(add(10, 20))\n\n<|assistant|>\n', 'tokens_generated': 64}" MATCH
Inference Time - 2650.28 ms INFO
Notes: Simple function - answer must contain 'def add' and 'return'

Source: HuggingFace Model Hub

large PASSED
Input Data
{ "prompt": "Write a Python function to implement binary search on a sorted list.", "max_tokens": 256, "temperature": 0.1 }
Field Expected Actual Result
Validation Type generation_contains generation_contains INFO
Expected Keywords ['def', 'binary', 'return'] Found: ['def', 'binary', 'return'] PASS
Generated Text (any containing keywords) "{'generated_text': 'Sure, here is a Python function that implements binary search on a sorted list:\n\n```python\ndef binary_search(arr, low, high, x):\n \n if high >= low:\n \n mid = (high ..." MATCH
Inference Time - 9015.38 ms INFO

Source: HuggingFace Model Hub