LLM Expect

Nov 24

A Lightweight, Developer-First Way to Test LLM-Powered Functions

Over the past year, I’ve watched the LLM evaluation ecosystem get more complicated, not less. Teams are now building agent workflows, reasoning chains, translation systems, summarizers, RAG pipelines, and production LLM features. Yet one foundational problem remains surprisingly unsolved:

How do we reliably test LLM-powered functions
How do we verify expected behavior before shipping
How do we measure quality without rebuilding evaluation tooling every time

In traditional software, unit tests are a given.
With LLMs, testing is still early and fragmented. Most tools today are heavy, configuration-driven, dashboard-centric, or tied to proprietary infrastructure.

I built LLM-Expect as a tiny, minimalist, developer-first Python SDK that makes LLM evaluation feel as natural as writing a function.

The belief behind it is simple:

Evaluation should live in your codebase, not in a separate platform.

Why I Built LLM-Expect

Across conversational AI, applied LLM systems, innovation platforms, and agent workflows, I kept seeing the same pattern:

Teams writing bespoke evaluation harnesses
One-off notebooks
JSON scripts hacked together for each project
Common metrics re-implemented over and over
Heavy MLOps setups used for simple function checks

Current LLM QA often looks like:

Manual checking
Notebooks
Ad-hoc scripts
Full MLOps platforms that are too complex for the task

This raised one question:

What if LLM evaluation was as simple as running a function over a JSONL dataset
What if the tests just worked

The design principle for LLM-Expect:

Zero friction. Zero config. Zero ceremony.
Code plus dataset equals clarity.

What LLM-Expect Is and Is Not

LLM-Expect is

A lightweight Python SDK for evaluating LLM-powered functions with simple JSONL datasets. Everything runs locally. No servers or dashboards.

LLM-Expect is not

A cloud platform
An MLOps suite
A data generation tool
A YAML framework
A system that requires config files

There are no config files, no CLI, no web UI, and no pipelines.
Just Python.

Evaluation as the New Unit Test

As LLMs move deeper into enterprise workflows, a new reality is emerging:

Evaluation is becoming the new unit test.

Traditional testing asks:
Did this function return the expected output

LLM evaluation asks deeper questions like:

Is the output acceptable
Does it follow the schema
Is it factually correct
Is it semantically aligned with the expectation

LLM-Expect brings that discipline into your development workflow.

Example

Instead of:

print(my_llm_function("hello"))

You can decorate your function:

from llm_expect import llm_expect

@llm_expect(dataset="tests.jsonl")
def my_llm_function(prompt: str):
    return {"response": "hello"}

then run

my_llm_function.run_eval()

How It Works

LLM-Expect relies on three pieces:

1. Your function

A Python function that calls an LLM or applies text logic.

2. A JSONL dataset

Each line is a test case:

{"input": {"text": "hello"}, "expected": {"output": "hello"}}

3. The LLM-Expect engine

It loads the dataset, runs the function, infers the correct metrics, and produces:

Case results
Clean summaries
Version-friendly JSON outputs

Zero Config Metric Inference

LLM-Expect selects metrics based on the structure of expected.

Exact match

{"expected": {"output": "Paris"}}

Schema checks

{"expected": {"schema": {"name": "string", "age": "int"}}}

Judge based checks

{"expected": {"judge": "semantic_match"}}

Why JSONL

JSONL is simple, diff-friendly, version-controlled, and easy to generate.
It is widely used across OpenAI evals, Anthropic evals, and research teams.

Built-In Metrics

ExactMatchAccuracy
SchemaFidelity
Judge based scoring

More metrics will follow, guided by real-world usefulness.

Minimal Example

Dataset (tests.jsonl)

{"input": {"q": "Capital of France"}, "expected": {"output": "Paris"}}
{"input": {"q": "2+2"}, "expected": {"output": "4"}}

Run

import os
from anthropic import Anthropic
from llm_expect import llm_expect

client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

def call_llm(prompt):
    message = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[1].text

@llm_expect(dataset="tests.jsonl")
def generate(prompt: str):
    return call_llm(prompt)

if __name__ == "__main__":
    generate.run_eval()

The Future of LLM-Expect

Additional semantic metrics
Optional CLI
More judge providers
Streamlined CI integration
Support for multi-step agent evaluation
A simple dataset builder (optional)

The philosophy stays the same:

Keep it local.
Keep it lightweight.
Keep it developer-first.

Explore More

DOCS

Github

Alex Thorpe

LLM Expect

Why I Built LLM-Expect

What LLM-Expect Is and Is Not

LLM-Expect is

LLM-Expect is not

Evaluation as the New Unit Test

Example

How It Works

1. Your function

2. A JSONL dataset

3. The LLM-Expect engine

Zero Config Metric Inference

Exact match

Schema checks

Judge based checks

Why JSONL

Built-In Metrics

Minimal Example

Dataset (tests.jsonl)

Run

The Future of LLM-Expect

Explore More

SITE MAP

FOLLOW

alex thorpe

LLM Expect

Why I Built LLM-Expect

What LLM-Expect Is and Is Not

LLM-Expect is

LLM-Expect is not

Evaluation as the New Unit Test

Example

How It Works

1. Your function

2. A JSONL dataset

3. The LLM-Expect engine

Zero Config Metric Inference

Exact match

Schema checks

Judge based checks

Why JSONL

Built-In Metrics

Minimal Example

Dataset (tests.jsonl)

Run

The Future of LLM-Expect

Explore More

YOU NEED TASTE: The skill ai can’t replace

The Definitive Guide to GRPO

SITE MAP

FOLLOW

alex thorpe