Applied AI / Agent Engineer

Thomas Peng.
Builds agents.
Measures honestly.

Graphic designer turned AI-native builder. The differentiated story: he builds agentic systems AND evaluates them honestly. Deterministic scoring, adversarial verification, cost-gated runs, and honest nulls instead of hype.

Get in touch View work

The work

One shared kernel. Three verified artifacts.

Every artifact below vendors the same Quorum core/ substrate: cost-aware model routing, adversarial multi-agent verification, full tracing. A real "I built a substrate and proved it on multiple problems" story. Eval design is deterministic (no LLM-judge in the success path). Nulls are reported as found.

Principle 1

Deterministic scoring

No LLM-judge on the success path. Span-IoU, exact match, p-values.

Principle 2

Adversarial verification

K skeptics per finding. Findings survive a cross-examiner.

Principle 3

Honest nulls

Report what did not work. A null is data, not a failure.

Object no. 1 of 3 / Artifact

Quorum

Task-aware agent orchestrator / Flagship

github.com/7P3ng/quorum ↗quorum.thomaspeng.ca ↗

Lead finding

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Held-out real target: 3 of 3 genuine bugs found, 0 surviving false positives.

0.0%

False positives (post-K=3)

77.8%

Recall retained

$0.25

Approx. cost per run

Tests, CI green

Cost-aware model routing (DeepSeek, Haiku, Sonnet, Opus) plus adversarial multi-agent verification plus full tracing, with a trace UI that looks like a product. Fans out finders per file, K skeptics per finding, concurrency cap 8. Cost-routing claim is operator-gated on an Anthropic key. Present as harness committed, live multi-tier number gated, not a fabricated routing number. make eval-dry reproduces offline.

Live trace UIOpen live ↗

Open Quorum trace UI ↗

Object no. 2 of 3 / Artifact

Aegis

Adaptive red-team gauntlet / Artifact no. 2

github.com/7P3ng/aegis ↗7p3ng.github.io/aegis/ ↗

Lead finding (the sophisticated, honest one)

A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. But the full defense stack erases the gap (1.7% vs 2.8%, p=0.40, not significant). Defenses matter more than model choice.

25pp

Defense reduction (29.2% to 4.2%)

p=0.40

Reasoning gap with full defense

+5.9pp

Adaptation lift (24.0% to 29.9%)

Tests, CI green

An adaptive attacker agent red-teams a target on two harmless proxies (canary-string extraction plus prompt-injection sentinel), scored deterministically (exact match, no LLM judge). Adaptation lift 24.0% to 29.9% became significant only after scaling the benchmark (McNemar b=17, c=0, p approx 0). Frame as: scaling is the legit power lever, not p-hacking. Vendors Quorum core/.

Live demoOpen live ↗

Open Aegis demo ↗

Object no. 3 of 3 / Artifact

FieldAgent

CUAD contract red-flag finder / Artifact no. 3

github.com/7P3ng/fieldagent ↗fieldagent.thomaspeng.ca ↗

Lead finding (the honest one)

The "agentic chunking lift" is model-specific noise, not a real advantage. It looked like +0.45 on DeepSeek only because of a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap) and it ties on Claude Sonnet. This honesty is the point.

0.548

Detection F1 (P=0.741, R=0.435)

+0.21

F1 over keyword floor

Held-out CUAD contracts

Tests, CI green

An agent reads a real commercial contract and flags risk-bearing clauses (span, severity, plain-English risk), graded span-IoU against CUAD gold. No LLM judge. 95% CI [0.460, 0.637], 20 held-out contracts. Party names and dollar figures are redacted in the demo. Vendors Quorum core/.

Live surfaceOpen live ↗

Open FieldAgent ↗

Three forms

The artifacts, as objects.

Form 01

Quorum

Torus knot. Orchestration, continuous connection.

Form 02

Aegis

Dodecahedron. Layered facets, layered defenses.

Form 03

FieldAgent

Octahedron. Precision geometry, deterministic grading.

Methodology

Skill-Tuning Council

Self-improving skill orchestrator / Internal infra

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. The pipeline is: adversary sends attack, editors revise, merger synthesizes, council votes, escalate on disagreement. 576 tests. Internal infrastructure. No public URL. The same adversarial-then-verify pattern that runs through every artifact.

adversary→editors→merger→council (4-proxy)→escalate if split→ship

Eval discipline

How I measure.

No LLM-judge in the success path

Span-IoU, exact match, p-values. The model cannot grade its own output. Deterministic metrics only.

Adversarial verification

K skeptics per finding. A finding that cannot survive cross-examination is not a finding.

Cost-gated reproducible runs

make eval-dry reproduces offline at near-zero cost. Full runs are gated on actual API keys, not assumed.

Honest nulls

A null is data. Truncation artifacts, CIs that overlap, effects that only appear at scale: all reported as found.

Contact

Let's work together.

thomas@thomaspeng.ca github.com/7P3ng ↗

Thomas Peng.Builds agents.Measures honestly.

One shared kernel. Three verified artifacts.

Quorum

Aegis

FieldAgent

The artifacts, as objects.

Skill-Tuning Council

How I measure.

Let's work together.

Thomas Peng.
Builds agents.
Measures honestly.