How the benchmark is measured.
Plain language. Sample sizes. Caveats. If a number looks weird, the explanation is below — not buried.
The prompt corpus
We track 640 prompts spanning an "All" sample that mixes everything plus seven industry-specific samples: B2B SaaS, E-commerce, Marketing, AI tools, DevTools, Fintech, and Healthcare. Prompts come from real search demand — Google Search Console exports, Perplexity prompt logs, and the buyer questions surfaced by the Visibly audit pipeline — and are intent-typed as informational, commercial, or navigational.
The corpus is fixed week-to-week for stability; we expand it once per quarter and publish the diff so the dataset stays comparable across time.
The five AI surfaces
We query the same prompt against five live-browsing AI surfaces: ChatGPT Search, Perplexity, Claude (with web), Gemini, and Google AI Overviews. Each returns a small set of citations; we capture the cited URLs verbatim, classify the content type, and aggregate.
The seven content types
Every cited URL is labeled with exactly one of seven content types by an LLM-judge classifier:
- Comparison
- "X vs Y", "best X for Y", multi-vendor evaluations. Head-to-head structure.
- Listicle
- Numbered or bulleted lists of items in a category ("12 best X", "top 7 Y").
- Guide
- Step-by-step procedural content. HowTo schema or its equivalent.
- Explainer
- Definition-led content. "What is X?", concept articles.
- Case study
- A specific customer or scenario walked through with outcome metrics.
- Tool
- An interactive calculator, comparison engine, or other utility hosted on the page.
- News
- Time-sensitive product announcements, releases, or industry coverage.
Refresh cadence
The full corpus is re-run every Monday at 09:00 UTC. Numbers on the benchmark page reflect last week's run. Once a month we publish a narrative interpretation of the four-week trend at /benchmark/reports.
Limits of the dataset
Three caveats we want to be loud about:
- Sample sizes vary per category. "All" sees the full 640-prompt run; the seven industry-specific sub-categories each see roughly 80 prompts. Smaller samples are noisier; we flag any cell with fewer than 20 citations in the source data.
- LLM-judge classification is imperfect. Edge cases (e.g. a guide-formatted comparison) get a single label. We accept ~5% classification noise; the trends still hold.
- The corpus reflects buyer-intent queries, not the open web. If you publish in a vertical we don't sample, the numbers may not generalize to your content. Request a category we don't cover by emailing the team.
What we publish — and what we don't
We publish: the matrix, the deltas, and our interpretation of the deltas. We don't publish: the raw cited URLs (publisher privacy), prompt-level results (gives away the corpus), or any per-domain ranking.
Reproducibility
The data file backing the page lives at src/data/benchmark.json
in the marketing site repo and gets regenerated on every Monday's
Prompt Monitor run. The shape is stable; the methodology page versions
as the dataset evolves.
Brand marks
OpenAI, Perplexity, Anthropic (Claude), and Google (Gemini, AI Overviews) brand marks appearing on this page and on the Index are used for editorial benchmark comparison. All trademarks belong to their respective owners. The Visibly Index is independent and unaffiliated with any of these companies.
Last updated · 2026-05-26 · Questions: hello@visibly.so