Applied AI / Agent Engineer

Thomas Peng.
Builds agents.
Measures honestly.

Graphic designer turned AI-native builder. The differentiated story: he builds agentic systems AND evaluates them honestly. Deterministic scoring, adversarial verification, cost-gated runs, and honest nulls instead of hype.

The work

One shared kernel. Three verified artifacts.

Every artifact below vendors the same Quorum core/ substrate: cost-aware model routing, adversarial multi-agent verification, full tracing. A real "I built a substrate and proved it on multiple problems" story. Eval design is deterministic (no LLM-judge in the success path). Nulls are reported as found.

Principle 1
Deterministic scoring
No LLM-judge on the success path. Span-IoU, exact match, p-values.
Principle 2
Adversarial verification
K skeptics per finding. Findings survive a cross-examiner.
Principle 3
Honest nulls
Report what did not work. A null is data, not a failure.
Object no. 1 of 3 / Artifact

Quorum

Task-aware agent orchestrator / Flagship

Lead finding

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Held-out real target: 3 of 3 genuine bugs found, 0 surviving false positives.

0.0%
False positives (post-K=3)
77.8%
Recall retained
$0.25
Approx. cost per run
58
Tests, CI green

Cost-aware model routing (DeepSeek, Haiku, Sonnet, Opus) plus adversarial multi-agent verification plus full tracing, with a trace UI that looks like a product. Fans out finders per file, K skeptics per finding, concurrency cap 8. Cost-routing claim is operator-gated on an Anthropic key. Present as harness committed, live multi-tier number gated, not a fabricated routing number. make eval-dry reproduces offline.

Object no. 2 of 3 / Artifact

Aegis

Adaptive red-team gauntlet / Artifact no. 2

Lead finding (the sophisticated, honest one)

A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. But the full defense stack erases the gap (1.7% vs 2.8%, p=0.40, not significant). Defenses matter more than model choice.

25pp
Defense reduction (29.2% to 4.2%)
p=0.40
Reasoning gap with full defense
+5.9pp
Adaptation lift (24.0% to 29.9%)
78
Tests, CI green

An adaptive attacker agent red-teams a target on two harmless proxies (canary-string extraction plus prompt-injection sentinel), scored deterministically (exact match, no LLM judge). Adaptation lift 24.0% to 29.9% became significant only after scaling the benchmark (McNemar b=17, c=0, p approx 0). Frame as: scaling is the legit power lever, not p-hacking. Vendors Quorum core/.

Object no. 3 of 3 / Artifact

FieldAgent

CUAD contract red-flag finder / Artifact no. 3

Lead finding (the honest one)

The "agentic chunking lift" is model-specific noise, not a real advantage. It looked like +0.45 on DeepSeek only because of a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap) and it ties on Claude Sonnet. This honesty is the point.

0.548
Detection F1 (P=0.741, R=0.435)
+0.21
F1 over keyword floor
20
Held-out CUAD contracts
47
Tests, CI green

An agent reads a real commercial contract and flags risk-bearing clauses (span, severity, plain-English risk), graded span-IoU against CUAD gold. No LLM judge. 95% CI [0.460, 0.637], 20 held-out contracts. Party names and dollar figures are redacted in the demo. Vendors Quorum core/.

Three forms

The artifacts, as objects.

Form 01
Quorum
Torus knot. Orchestration, continuous connection.
Form 02
Aegis
Dodecahedron. Layered facets, layered defenses.
Form 03
FieldAgent
Octahedron. Precision geometry, deterministic grading.

Skill-Tuning Council

Self-improving skill orchestrator / Internal infra

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. The pipeline is: adversary sends attack, editors revise, merger synthesizes, council votes, escalate on disagreement. 576 tests. Internal infrastructure. No public URL. The same adversarial-then-verify pattern that runs through every artifact.

adversaryeditorsmergercouncil (4-proxy)escalate if splitship

How I measure.

01
No LLM-judge in the success path

Span-IoU, exact match, p-values. The model cannot grade its own output. Deterministic metrics only.

02
Adversarial verification

K skeptics per finding. A finding that cannot survive cross-examination is not a finding.

03
Cost-gated reproducible runs

make eval-dry reproduces offline at near-zero cost. Full runs are gated on actual API keys, not assumed.

04
Honest nulls

A null is data. Truncation artifacts, CIs that overlap, effects that only appear at scale: all reported as found.

Contact