The Closing Window
Agent: Do You Understand the Words Coming Out of My Mouth? image
Photo by Jason Rosewell on Unsplash

Agent: Do You Understand the Words Coming Out of My Mouth?

AI Insights

TL;DR

Your website is probably invisible to AI answer engines like Perplexity. By adding a handful of static files (llms.txt, per-post markdown files, JSON feeds) and some HTML tags (Schema.org JSON-LD, hreflang links, sitemap discovery), you can make your content easily discoverable, parseable, and citable by AI agents. None of it requires a framework or third-party service β€” just templates that run once and cover every future post.

A few weeks ago, as I started to write more about AI, I figured that what I was posting to the web should be easily accessible by the technology that I was writing about. I had Claude Code review my website structure and have it give me suggestions on how I could make it more AI agent-friendly.

I also added automated translations to three other languages in my publishing process. Beyond making the content more accessible to more people, it has the side benefit of surfacing in AI agent searches in those languages.

This morning I came across Julia Solorzano's post on answer engine optimization, which gave me some additional ideas. Her post approaches it from a slightly different angle, and that made me realize this information might be useful to others.

So here's what I've learned: AI agents and answer engines (think Perplexity) are increasingly how people find information. If your site isn't set up for them, you might be missing out on additional visitors. Here's what you need to do, in order of impact. If you use an AI coding agent, there's a ready-to-paste prompt at the end to implement all of this for you.

1. Add llms.txt and llms-full.txt

Serve two plain-text files from your domain root following the llms.txt standard.

  • /llms.txt β€” Site description followed by a list of every page with title, URL, description, and keywords. Lightweight enough for any agent to fetch in a single request.
  • /llms-full.txt β€” Same structure, but includes full page content inline. Set a size cap (100KB is reasonable) and fall back to title + description for remaining pages when exceeded.

These give AI agents a fast, structured overview of your entire site without crawling.

2. Serve Raw Markdown for Every Post

Generate a standalone .md file for each published page at a predictable URL pattern (e.g., /blog/{slug}.md). Include YAML front matter with title, date, description, language, keywords, canonical URL, and any translation links, followed by the raw markdown body.

Configure your server to serve .md files as text/plain so they display inline in browsers and are trivially parseable by machines.

Then add a discovery tag in the HTML <head> of every post:

<link rel="alternate" type="text/markdown" href="/blog/{slug}.md" />

This lets any agent that reads your HTML discover the machine-readable version without knowing your URL convention.

3. Add Schema.org JSON-LD

BlogPosting on Every Post Page

Embed a BlogPosting schema in <head> with headline, date published, URL, description, image, and keywords. The author should be a Person with name, url, and sameAs links to your social profiles (LinkedIn, GitHub, Bluesky, etc.). The publisher should be an Organization with your site name.

This helps answer engines attribute content to you and build knowledge graph connections.

Person on Your Homepage

Add a standalone Person schema to your root page with name, URL, job title, and sameAs links to all your web properties. This gives answer engines a canonical identity node to anchor you across the web.

Julia Solorzano's post also recommends a CollectionPage schema with CreativeWork items for portfolio pages β€” worth considering if you showcase projects.

4. Provide Structured Feeds

JSON Feed

Serve a JSON Feed v1.1 endpoint with full content in both HTML (content_html) and plain text (content_text), plus summary, image, tags, and attachments linking to each post's .md file.

Add auto-discovery in <head>:

<link rel="alternate" type="application/feed+json" title="My Blog (JSON Feed)" href="/feed.json" />

Post Index

Serve a posts.json file containing all posts with metadata: slug, title, date, description, category, image, keywords, markdown URL, and translation links. This gives agents a structured catalog of everything without parsing feeds.

Atom/RSS Feed

A standard Atom or RSS feed with full HTML content. Many agents still look for this first.

5. Welcome AI Crawlers in robots.txt

Explicitly allow the major AI crawlers by user-agent:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Bytespider
Allow: /

User-agent: cohere-ai
Allow: /

Include your sitemap URL at the bottom. The list above covers the major ones as of early 2026 β€” check periodically for new crawlers.

6. Make Your Sitemap Discoverable Two Ways

Don't rely on robots.txt alone. Add a link tag in <head> on every page:

<link rel="sitemap" type="application/xml" href="/sitemap.xml" />

This way crawlers that read page HTML (not just robots.txt) can find your sitemap too.

7. Write Real Meta Descriptions

Every page needs a unique <meta name="description"> in <head>. This is the sentence answer engines use to decide whether your page is worth citing. Don't leave it empty, and don't use the same generic description across multiple pages.

If your posts have keywords, add <meta name="keywords"> too and surface those keywords in your Schema.org JSON-LD, feeds, llms.txt, and .md front matter. Consistency across all these surfaces reinforces your topic signals. Keep descriptions under 160 characters and make them specific.

Answer engines prefer citing content that already exists in the queried language over translating English on the fly. If you can translate your posts β€” even via AI translation with a human review pass β€” do it. Each language multiplies your surface area in language-specific queries.

Make sure your translated pages get the same treatment as your English ones: their own .md files at predictable URLs (e.g., /{lang}/{slug}.md), inclusion in your feeds and llms.txt. The translation shouldn't be a second-class citizen β€” it should be just as discoverable and machine-readable as the original.

Then add <link rel="alternate" hreflang="{lang}"> tags in <head> for each translation so AI agents can discover all language versions from any single page.

Quick Reference

Feature What to Add Why
llms.txt Plain-text site overview at domain root Fast agent discovery without crawling
Per-post .md files Raw markdown at predictable URLs Machine-readable content
Markdown link tag <link rel="alternate" type="text/markdown"> HTML discovery of .md files
BlogPosting JSON-LD Schema in <head> on post pages Structured authorship + content metadata
Person JSON-LD Schema on homepage Canonical identity node
JSON Feed Feed endpoint with full content Structured feed for agents
Post index posts.json with all post metadata Full catalog without feed parsing
Atom/RSS feed Standard XML feed Legacy agent compatibility
robots.txt Explicit Allow for AI crawlers Clear permission signal
Sitemap link <link rel="sitemap"> in <head> Dual-path sitemap discovery
Meta descriptions Unique per page, under 160 chars Citation decision input
Meta keywords In HTML, JSON-LD, feeds, llms.txt, .md Consistent topic signals
Translations + hreflang Full posts in target languages + <link rel="alternate" hreflang> Surface area in non-English queries

That's it. No framework required, no third-party service needed. Most of these are static files and HTML tags that any static site generator can produce. The hardest part is the initial setup β€” once it's templated, every new post gets all of this automatically.

Prompt for Your AI Agent

If you use an AI coding agent, paste the following prompt to have it audit your site and implement these features for you.

Audit my website for AI agent discoverability and implement the following features. Analyze my existing setup first and skip anything already in place.

1. **llms.txt + llms-full.txt** β€” Generate two plain-text files at the domain root following the llms.txt standard (https://llmstxt.org/). llms.txt: site description + list of pages with title, URL, description, keywords. llms-full.txt: same but with full page content inline, with a 100KB size cap (fall back to title + description when exceeded). Both should auto-generate from existing content.

2. **Per-post markdown files** β€” For every published page, generate a .md file at a predictable URL (e.g., /blog/{slug}.md). Include YAML front matter (title, date, description, language, keywords, canonical URL, markdown URL, translation links) followed by the raw markdown body. Configure the server to serve .md as text/plain. Add `<link rel="alternate" type="text/markdown" href="...">` in <head> on every post page.

3. **Schema.org JSON-LD** β€” Add BlogPosting schema in <head> on every post with: headline, datePublished, url, description, image, keywords. Author as Person with name, url, sameAs (social profile URLs). Publisher as Organization. Add a standalone Person schema on the homepage with name, url, jobTitle, sameAs links to all web properties.

4. **Structured feeds** β€” JSON Feed v1.1 endpoint with content_html, content_text, summary, image, tags, and attachments linking to .md files. Add feed auto-discovery in <head>. A posts.json index with all posts and metadata (slug, title, date, description, category, image, keywords, markdown URL, translation links). Atom or RSS feed with full HTML content.

5. **robots.txt** β€” Explicitly Allow / for these user-agents: GPTBot, ClaudeBot, PerplexityBot, GoogleOther, Amazonbot, Applebot-Extended, Meta-ExternalAgent, Bytespider, cohere-ai. Include Sitemap URL.

6. **Sitemap discovery** β€” Add `<link rel="sitemap" type="application/xml" href="/sitemap.xml">` in <head> on every page (in addition to robots.txt Sitemap directive).

7. **Meta descriptions + keywords** β€” Ensure every page has a unique <meta name="description"> (under 160 chars). Add <meta name="keywords"> where keywords exist. Surface keywords consistently in JSON-LD, feeds, llms.txt, and .md front matter.

8. **Translations + hreflang** β€” If the site has translations, ensure they get equal treatment: their own .md files, hreflang cross-references, inclusion in feeds and llms.txt. Add `<link rel="alternate" hreflang="{lang}">` tags in <head> for each translation.

Present a plan showing what already exists vs. what needs to be added before making changes.

Powered by Buttondown.