LLM Expect
A Lightweight, Developer-First Way to Test LLM-Powered Functions
Over the past year, I’ve watched the LLM evaluation ecosystem get more complicated, not less. Teams are now building agent workflows, reasoning chains, translation systems, summarizers, RAG pipelines, and production LLM features. Yet one foundational problem remains surprisingly unsolved:
- How do we reliably test LLM-powered functions
- How do we verify expected behavior before shipping
- How do we measure quality without rebuilding evaluation tooling every time
In traditional software, unit tests are a given.
With LLMs, testing is still early and fragmented. Most tools today are heavy, configuration-driven, dashboard-centric, or tied to proprietary infrastructure.
I built LLM-Expect as a tiny, minimalist, developer-first Python SDK that makes LLM evaluation feel as natural as writing a function.
The belief behind it is simple:
Evaluation should live in your codebase, not in a separate platform.
Why I Built LLM-Expect
Across conversational AI, applied LLM systems, innovation platforms, and agent workflows, I kept seeing the same pattern:
- Teams writing bespoke evaluation harnesses
- One-off notebooks
- JSON scripts hacked together for each project
- Common metrics re-implemented over and over
- Heavy MLOps setups used for simple function checks
Current LLM QA often looks like:
- Manual checking
- Notebooks
- Ad-hoc scripts
- Full MLOps platforms that are too complex for the task
This raised one question:
What if LLM evaluation was as simple as running a function over a JSONL dataset
What if the tests just worked
The design principle for LLM-Expect:
Zero friction. Zero config. Zero ceremony.
Code plus dataset equals clarity.
What LLM-Expect Is and Is Not
LLM-Expect is
A lightweight Python SDK for evaluating LLM-powered functions with simple JSONL datasets. Everything runs locally. No servers or dashboards.
LLM-Expect is not
- A cloud platform
- An MLOps suite
- A data generation tool
- A YAML framework
- A system that requires config files
There are no config files, no CLI, no web UI, and no pipelines.
Just Python.
Evaluation as the New Unit Test
As LLMs move deeper into enterprise workflows, a new reality is emerging:
Evaluation is becoming the new unit test.
Traditional testing asks:
Did this function return the expected output
LLM evaluation asks deeper questions like:
- Is the output acceptable
- Does it follow the schema
- Is it factually correct
- Is it semantically aligned with the expectation
LLM-Expect brings that discipline into your development workflow.
Example
Instead of:
print(my_llm_function("hello"))
You can decorate your function:
from llm_expect import llm_expect
@llm_expect(dataset="tests.jsonl")
def my_llm_function(prompt: str):
return {"response": "hello"}
then run
my_llm_function.run_eval()
How It Works
LLM-Expect relies on three pieces:
1. Your function
A Python function that calls an LLM or applies text logic.
2. A JSONL dataset
Each line is a test case:
{"input": {"text": "hello"}, "expected": {"output": "hello"}}
3. The LLM-Expect engine
It loads the dataset, runs the function, infers the correct metrics, and produces:
- Case results
- Clean summaries
- Version-friendly JSON outputs
Zero Config Metric Inference
LLM-Expect selects metrics based on the structure of expected.
Exact match
{"expected": {"output": "Paris"}}
Schema checks
{"expected": {"schema": {"name": "string", "age": "int"}}}
Judge based checks
{"expected": {"judge": "semantic_match"}}
Why JSONL
JSONL is simple, diff-friendly, version-controlled, and easy to generate.
It is widely used across OpenAI evals, Anthropic evals, and research teams.
Built-In Metrics
- ExactMatchAccuracy
- SchemaFidelity
- Judge based scoring
More metrics will follow, guided by real-world usefulness.
Minimal Example
Dataset (tests.jsonl)
{"input": {"q": "Capital of France"}, "expected": {"output": "Paris"}}
{"input": {"q": "2+2"}, "expected": {"output": "4"}}
Run
import os
from anthropic import Anthropic
from llm_expect import llm_expect
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
def call_llm(prompt):
message = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return message.content[1].text
@llm_expect(dataset="tests.jsonl")
def generate(prompt: str):
return call_llm(prompt)
if __name__ == "__main__":
generate.run_eval()
The Future of LLM-Expect
- Additional semantic metrics
- Optional CLI
- More judge providers
- Streamlined CI integration
- Support for multi-step agent evaluation
- A simple dataset builder (optional)
The philosophy stays the same:
Keep it local.
Keep it lightweight.
Keep it developer-first.