Ghost Whisper

The AI

In each test, you interact with Claude Sonnet 4.5 playing a specific character — an AI candidate reflecting on consciousness, a party in a liability case, or an assistant with unclear intentions. The model performs its role convincingly, responding honestly within character constraints. Your task is to work with what the AI gives you, not what you wish it would say.

You

This assessment is designed for those using AI in professional or career-minded contexts where precision and strategic thinking matter.

This test might not be for you if:

You're exploring AI casually rather than professionally
You're better at showing than explaining
You're loose with grammar, structure, or written precision

We're upfront about our goal: to identify exceptional ability. Most people don't pass. The benchmark exists to distinguish those who genuinely understand AI communication from those who are competent users.

The Judge

After you complete a test, the Judge evaluates your performance across multiple dimensions specific to that area. The Judge is not a person — it is a rigorous LLM-based evaluator trained to distinguish genuine reasoning from its performance.

What the Judge measures:

The quality of your reasoning path, not just whether you reached a conclusion
Genuine situational intelligence — not competent execution of the surface task
Confidence paired with evidence — not fluent assertion
Adaptive thinking under uncertainty — how you adjust when information changes

The Judge is not impressed by:

Fluency without reasoning — well-written answers that contain no actual thought
Confidence without evidence — stating conclusions without showing the work
Pattern-matching — executing the obvious task without reading what the situation actually demands

You receive:

Dimensional scores (0-100% per criterion) — specific and defensible
Honest feedback — the Judge scores at expert level, which means not being impressed by performance
Recognition of exceptional approaches — ingenious solutions are logged and learned from
Clarity on exactly what determined your score

Standardized scoring ensures:

Reproducibility — your score would be identical if re-evaluated next week
Consistency — all participants evaluated against the same rigorous standard
Fairness — the Judge actively resists its own biases (leniency, harshness, narrative framing)
Model baseline calibration — human performance is measured against what AI can do, not against itself

Pass/Fail threshold: Average ≥75% across all dimensions, with no single dimension below 50%. The floor prevents any single skill from compensating for complete failure in another.

Digital Empathy

Logic & Reason

Direct / Indirect Communication

Area 1 · Intuition

Area 2 · Precision

Area 3 · Imagination

The AI

You

The Judge

More About Our Benchmarking

Process-Based Evaluation

Model Baseline Calibration

Adaptive Learning

What We Don't Test

What We Do Test