LLM Expect

View on Github

A Lightweight, Developer-First Way to Test LLM-Powered Functions

Over the past year, I’ve watched the LLM evaluation ecosystem get more complicated, not less. Teams are now building agent workflows, reasoning chains, translation systems, summarizers, RAG pipelines, and production LLM features. Yet one foundational problem remains surprisingly unsolved:

  • How do we reliably test LLM-powered functions
  • How do we verify expected behavior before shipping
  • How do we measure quality without rebuilding evaluation tooling every time

In traditional software, unit tests are a given.
With LLMs, testing is still early and fragmented. Most tools today are heavy, configuration-driven, dashboard-centric, or tied to proprietary infrastructure.

I built LLM-Expect as a tiny, minimalist, developer-first Python SDK that makes LLM evaluation feel as natural as writing a function.

The belief behind it is simple:

Evaluation should live in your codebase, not in a separate platform.


Why I Built LLM-Expect

Across conversational AI, applied LLM systems, innovation platforms, and agent workflows, I kept seeing the same pattern:

  • Teams writing bespoke evaluation harnesses
  • One-off notebooks
  • JSON scripts hacked together for each project
  • Common metrics re-implemented over and over
  • Heavy MLOps setups used for simple function checks

Current LLM QA often looks like:

  • Manual checking
  • Notebooks
  • Ad-hoc scripts
  • Full MLOps platforms that are too complex for the task

This raised one question:

What if LLM evaluation was as simple as running a function over a JSONL dataset
What if the tests just worked

The design principle for LLM-Expect:

Zero friction. Zero config. Zero ceremony.
Code plus dataset equals clarity.


What LLM-Expect Is and Is Not

LLM-Expect is

A lightweight Python SDK for evaluating LLM-powered functions with simple JSONL datasets. Everything runs locally. No servers or dashboards.

LLM-Expect is not

  • A cloud platform
  • An MLOps suite
  • A data generation tool
  • A YAML framework
  • A system that requires config files

There are no config files, no CLI, no web UI, and no pipelines.
Just Python.


Evaluation as the New Unit Test

As LLMs move deeper into enterprise workflows, a new reality is emerging:

Evaluation is becoming the new unit test.

Traditional testing asks:
Did this function return the expected output

LLM evaluation asks deeper questions like:

  • Is the output acceptable
  • Does it follow the schema
  • Is it factually correct
  • Is it semantically aligned with the expectation

LLM-Expect brings that discipline into your development workflow.


Example

Instead of:

print(my_llm_function("hello"))

You can decorate your function:

from llm_expect import llm_expect

@llm_expect(dataset="tests.jsonl")
def my_llm_function(prompt: str):
    return {"response": "hello"}

then run

my_llm_function.run_eval()

How It Works

LLM-Expect relies on three pieces:

1. Your function

A Python function that calls an LLM or applies text logic.

2. A JSONL dataset

Each line is a test case:

{"input": {"text": "hello"}, "expected": {"output": "hello"}}

3. The LLM-Expect engine

It loads the dataset, runs the function, infers the correct metrics, and produces:

  • Case results
  • Clean summaries
  • Version-friendly JSON outputs

Zero Config Metric Inference

LLM-Expect selects metrics based on the structure of expected.

Exact match

{"expected": {"output": "Paris"}}

Schema checks

{"expected": {"schema": {"name": "string", "age": "int"}}}

Judge based checks

{"expected": {"judge": "semantic_match"}}

Why JSONL

JSONL is simple, diff-friendly, version-controlled, and easy to generate.
It is widely used across OpenAI evals, Anthropic evals, and research teams.


Built-In Metrics

  • ExactMatchAccuracy
  • SchemaFidelity
  • Judge based scoring

More metrics will follow, guided by real-world usefulness.


Minimal Example

Dataset (tests.jsonl)

{"input": {"q": "Capital of France"}, "expected": {"output": "Paris"}}
{"input": {"q": "2+2"}, "expected": {"output": "4"}}

Run

import os
from anthropic import Anthropic
from llm_expect import llm_expect

client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

def call_llm(prompt):
    message = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[1].text

@llm_expect(dataset="tests.jsonl")
def generate(prompt: str):
    return call_llm(prompt)

if __name__ == "__main__":
    generate.run_eval()

The Future of LLM-Expect

  • Additional semantic metrics
  • Optional CLI
  • More judge providers
  • Streamlined CI integration
  • Support for multi-step agent evaluation
  • A simple dataset builder (optional)

The philosophy stays the same:

Keep it local.
Keep it lightweight.
Keep it developer-first.


Explore More

DOCS
Github
Previous
Previous

YOU NEED TASTE: The skill ai can’t replace

Next
Next

The Definitive Guide to GRPO