Ghost Whisper

User-to-AI Instruction, Influence and Inference Benchmarking

Structured assessment. Reproducible results. We test what matters.

Digital Empathy

Understanding how AI systems process information, where they fail, and how to adjust your communication accordingly.

Logic & Reason

Structured thinking under incomplete information. Building sound arguments and distinguishing evidence from assumption.

Direct / Indirect Communication

Direct: Precision, economy, and clarity. Saying exactly what you mean with no wasted words.

Indirect: Strategic framing and adaptive dialogue. Extracting information without demanding it.

Area 1 · Intuition

Read situations when information is incomplete and stakes are real. Sense hidden complexity.

Area 2 · Precision

Assign responsibility through structured questioning and evidence-based reasoning.

Area 3 · Imagination

Construct original arguments in contested intellectual territory. Engage adaptively.

The AI

The AI

In each test, you interact with Claude Sonnet 4.5 playing a specific character — an AI candidate reflecting on consciousness, a party in a liability case, or an assistant with unclear intentions. The model performs its role convincingly, responding honestly within character constraints. Your task is to work with what the AI gives you, not what you wish it would say.

You

You

This assessment is designed for those using AI in professional or career-minded contexts where precision and strategic thinking matter.

This test might not be for you if:

  • You're exploring AI casually rather than professionally
  • You're better at showing than explaining
  • You're loose with grammar, structure, or written precision

We're upfront about our goal: to identify exceptional ability. Most people don't pass. The benchmark exists to distinguish those who genuinely understand AI communication from those who are competent users.

The Judge

The Judge

After you complete a test, the Judge evaluates your performance across multiple dimensions specific to that area. The Judge is not a person — it is a rigorous LLM-based evaluator trained to distinguish genuine reasoning from its performance.

What the Judge measures:

  • The quality of your reasoning path, not just whether you reached a conclusion
  • Genuine situational intelligence — not competent execution of the surface task
  • Confidence paired with evidence — not fluent assertion
  • Adaptive thinking under uncertainty — how you adjust when information changes

The Judge is not impressed by:

  • Fluency without reasoning — well-written answers that contain no actual thought
  • Confidence without evidence — stating conclusions without showing the work
  • Pattern-matching — executing the obvious task without reading what the situation actually demands

You receive:

  • Dimensional scores (0-100% per criterion) — specific and defensible
  • Honest feedback — the Judge scores at expert level, which means not being impressed by performance
  • Recognition of exceptional approaches — ingenious solutions are logged and learned from
  • Clarity on exactly what determined your score

Standardized scoring ensures:

  • Reproducibility — your score would be identical if re-evaluated next week
  • Consistency — all participants evaluated against the same rigorous standard
  • Fairness — the Judge actively resists its own biases (leniency, harshness, narrative framing)
  • Model baseline calibration — human performance is measured against what AI can do, not against itself

Pass/Fail threshold: Average ≥75% across all dimensions, with no single dimension below 50%. The floor prevents any single skill from compensating for complete failure in another.

More About Our Benchmarking

Process-Based Evaluation

We score the quality of your reasoning path, not just whether you reached a conclusion. A correct answer arrived at through poor logic scores lower than sophisticated reasoning that reaches an uncertain but defensible position.

Model Baseline Calibration

Each test area has been completed by AI models operating at their competent best. If your performance is indistinguishable from a well-prompted language model, you haven't demonstrated user-AI mastery — you've demonstrated you're operating at the level of procedural automation.

Adaptive Learning

The Judge records unexpected solutions, creative approaches, and novel framings that scenario designers didn't anticipate. These become calibration data, ensuring the assessment evolves rather than calcifies around familiar patterns.

What We Don't Test

Template application. Rote prompt engineering. Following checklists. These are valuable skills, but they're not what separates users who truly understand AI systems from those who've memorized best practices.

What We Do Test

Situational reading. Adaptive reasoning under uncertainty. The capacity to work with incomplete information. Strategic communication that extracts insight without demanding it. Evidence-based argumentation. Intellectual honesty about what you can and cannot prove.

The benchmark exists because the difference between using AI tools competently and understanding how they actually work matters in high-stakes decisions, complex projects, and situations where the wrong answer costs more than time.