In November 2012, sitting across from Michael Cyger on the DomainSherpa podcast, Matt Mazur described the most successful piece of content he had ever written. He had pulled the .com zone file, looked at every modifier that the Lean Domain Search algorithm paired with a user's keyword, and tallied which prefixes and suffixes appeared most often.
His verbatim finding, from the transcript:
> "It turns out that 'my' is the most popular prefix and 'online' is the most popular suffix." > > — Matt Mazur, DomainSherpa interview, November 2012 (transcript line 193)
Are they still? That is the question this report exists to answer.
The methodology, the corpus, and the compute have all changed since 2012. Mazur worked off a zone file that contained roughly 100 million .com registrations. We worked off the May 2026 CZDS pull, which contains approximately [PLACEHOLDER — exact count from `data/derived/sld/*.parquet` row count]. He counted prefixes and suffixes by hand-rolling SQL against a copy of the zone he had loaded onto his own machine. We ran the same shape of query, but across the full corpus, with zero sampling, using DuckDB on a parquet projection of the SLD column.
The instinct, if you read enough of these reports, is to compress everything into a single headline number. We have resisted that. The story here is the movement — what rose, what fell, what didn't exist in 2012, what looks identical fourteen years later. Each section below is one slice of that movement.
## How big .com actually is, 2012 vs 2026
In 2012, Mazur's analysis sat on top of a .com zone of roughly 100 million names. In 2026, the zone contains approximately [PLACEHOLDER — exact figure from `data/derived/sld/*.parquet` row count] second-level domains.
That growth rate, distributed across fourteen years, is [PLACEHOLDER — derived from (2026_count - 2012_count) / 2012_count] in percentage terms and [PLACEHOLDER — compound annual growth rate calculated from the same two numbers] compounded annually. The Internet did not double. It also did not stand still.
What the headline number hides is composition. The .com namespace in 2012 was disproportionately populated by short, dictionary-word, English-language names registered between 1995 and 2008. The .com namespace in 2026 is dominated by long, compound, often non-English names registered between 2015 and 2024 — the result of more than a decade of bulk registration, defensive parking, and the simultaneous launch of Western and non-Western language markets onto a single global TLD. The composition shift matters more than the count shift.
## The 2026 prefix leaderboard
Mazur's 2012 top ten prefixes, for reference, ran: `my+`, `the+`, `web+`, `go+`, `super+`, `free+`, `green+`, `net+`, `new+`, `pro+`. Personal possession, definite article, web-as-substance, action verbs, hype amplifiers.
The 2026 top ten:
| Rank | Prefix | 2026 count | 2012 rank | Movement | |------|--------|------------|-----------|----------| | 1 | [PLACEHOLDER] | [PLACEHOLDER — from `data/derived/prefix_counts/*.parquet` ORDER BY count DESC LIMIT 1] | [PLACEHOLDER — cross-reference 2012 table] | [PLACEHOLDER] | | 2 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 3 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 4 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 5 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 6 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 7 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 8 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 9 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 10 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] |
The full 2026 top-100 prefix list is published alongside this article at `/research/state-of-com-2026/prefixes`. Each entry links to its `/trends/[term]` page where the count is broken down by SLD length, character set, and (where derivable) age cohort.
A prefix's rank, taken alone, is a vanity metric. The signal is what the rank tells you about the registrants. `my+` ranked first in 2012 because individuals registered names for themselves. If `my+` has fallen, it is because individual registrations have been swamped by organisational ones. If `my+` has held, individuals are still showing up in roughly the same proportion. Both findings would be worth knowing.
## The 2026 suffix leaderboard
Mazur's 2012 top ten suffixes ran: `+online`, `+web`, `+media`, `+world`, `+net`, `+group`, `+blog`, `+shop`, `+book`, `+store`. Web-as-place, media-as-noun, commerce.
The 2026 top ten:
| Rank | Suffix | 2026 count | 2012 rank | Movement | |------|--------|------------|-----------|----------| | 1 | [PLACEHOLDER] | [PLACEHOLDER — from `data/derived/suffix_counts/*.parquet` ORDER BY count DESC LIMIT 1] | [PLACEHOLDER — cross-reference 2012 table] | [PLACEHOLDER] | | 2 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 3 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 4 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 5 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 6 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 7 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 8 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 9 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 10 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] |
The full top-100 suffix list is at `/research/state-of-com-2026/suffixes`.
Notice what is not necessarily here. `+app` ranked twenty-third in 2012, when "app" was still a relatively new word. `+ai` did not appear at any rank that mattered in 2012, because in 2012 the dominant association of those two letters was the country code for Anguilla, not artificial intelligence. The presence or absence of these suffixes in the 2026 top ten will tell us whether a category became a default naming convention or remained a niche.
## What rose, what fell
The interesting cells in a 2012-vs-2026 prefix and suffix table are not the ones at the top. They are the ones that moved.
A movement table — the prefixes and suffixes that rose at least twenty positions, and the ones that fell at least twenty positions — is published at `/research/state-of-com-2026/movers`. The headline movers:
- Biggest riser, prefix: [PLACEHOLDER — from prefix_counts diff] (from rank [PLACEHOLDER] to rank [PLACEHOLDER]) - Biggest riser, suffix: [PLACEHOLDER — from suffix_counts diff] (from rank [PLACEHOLDER] to rank [PLACEHOLDER]) - Biggest faller, prefix: [PLACEHOLDER — from prefix_counts diff] (from rank [PLACEHOLDER] to rank [PLACEHOLDER]) - Biggest faller, suffix: [PLACEHOLDER — from suffix_counts diff] (from rank [PLACEHOLDER] to rank [PLACEHOLDER]) - New entries to the top 100 that did not appear at any rank in 2012: [PLACEHOLDER — list] - 2012 top-100 entries that fell out of the top 1,000 by 2026: [PLACEHOLDER — list]
Read these as a vocabulary diary. A naming convention that was native to one decade gets carried into the next or it does not. The convention that does not get carried over is more interesting than the one that does, because it tells you which assumptions about the Internet stopped being true.
## Length distribution
The average SLD in Mazur's 2012 corpus was [PLACEHOLDER — historical 2012 figure if reconstructible; otherwise note as unavailable] characters. In 2026, the average is [PLACEHOLDER — from `data/derived/length_dist/*.parquet`, weighted mean].
The shape of the distribution matters more than the mean. A bimodal distribution — a cluster of 4-6 character names and a separate cluster of 12-20 character names — would tell you that .com has split into two markets: the legacy short-name market and the modern long-name market. A single broad hump would tell you something else.
| Length | 2026 count | Share of corpus | |--------|------------|-----------------| | 1 | [PLACEHOLDER] | [PLACEHOLDER] | | 2 | [PLACEHOLDER] | [PLACEHOLDER] | | 3 | [PLACEHOLDER] | [PLACEHOLDER] | | 4 | [PLACEHOLDER] | [PLACEHOLDER] | | 5 | [PLACEHOLDER] | [PLACEHOLDER] | | 6 | [PLACEHOLDER] | [PLACEHOLDER] | | 7 | [PLACEHOLDER] | [PLACEHOLDER] | | 8 | [PLACEHOLDER] | [PLACEHOLDER] | | 9-12 | [PLACEHOLDER] | [PLACEHOLDER] | | 13-20 | [PLACEHOLDER] | [PLACEHOLDER] | | 21+ | [PLACEHOLDER] | [PLACEHOLDER] |
The full histogram, one row per length value from 1 to 63, is at `/research/state-of-com-2026/length-distribution`.
A name's length is one of the strongest priors on its readability. Processing fluency research — the same body of work that explains why "Coca-Cola" is easier to recall than "The Coca-Cola Company" — predicts that names beyond ten characters carry a measurable recall penalty. The distribution tells us how many .com registrants accepted that penalty in exchange for getting a name they could actually have.
## AI in domains
The single highest-stakes question this report can answer in 2026 is: how much of .com is now an AI domain?
The answer requires a careful definition. "AI" appears as a substring in thousands of English words that have nothing to do with artificial intelligence — `main`, `rain`, `claim`, `train`, `chain`, `brain`, `said`, `paid`, `fair`, `wait`, `painting`. A naive substring count would inflate the AI domain population by every name containing those tokens, which would be misleading at best and dishonest at worst.
We separate three classes:
1. `ai` as a morpheme — the SLD begins with `ai`, ends with `ai`, or contains `ai` flanked by clear word boundaries (hyphens, camelCase, or known word splits). Examples: `getai.com`, `ai-coach.com`, `tryai.com`, `myaitutor.com`. Count: [PLACEHOLDER — from `data/derived/trends/.parquet` WHERE term = 'ai' AND class = 'morpheme']. 2. `ai` adjacent to a known AI-adjacent token — `gpt`, `llm`, `agent`, `bot`, `chat`, `prompt`, `model`, `neural`. Counted only when both tokens appear in the same SLD. Count: [PLACEHOLDER — from `data/derived/trends/.parquet` WHERE term IN ('gpt','llm','agent', ...) AND co_occurrence_with_ai = true]. 3. `ai` ambiguous — appears in the SLD but is most likely part of a non-AI English word. We exclude these from the AI count and publish them separately so the methodology is fully auditable. Count: [PLACEHOLDER — same parquet, class = 'ambiguous'].
The headline AI .com count, using the conservative definition (class 1 + class 2): [PLACEHOLDER — sum of the above].
For context, the equivalent count run against the 2012 zone file using the same logic returns [PLACEHOLDER — historical 2012 figure if reconstructible from archived zone snapshots; otherwise note as unavailable]. The growth multiple is [PLACEHOLDER — ratio].
That number, whatever it lands at, is the single most-quotable statistic this report produces. It is also the number most likely to be challenged on Hacker News by someone pointing out that `main` and `rain` exist. The class breakdown above is the answer to that challenge — every disputed SLD is in class 3, and class 3 is excluded from the headline.
## Character frequency
Letters are not used equally in domain names. They are not used equally in English text either, but the distribution in domain names diverges from English text in instructive ways. English text leans heavily on `e`, `t`, `a`, `o`. Domain names lean differently, because brand names are not normal text — they are selected for distinctiveness, available alphabet positions, and (often) for the visual symmetry of the letterforms.
The character frequency across the 2026 .com SLD corpus, ranked from most to least common:
| Rank | Character | Count | Share | |------|-----------|-------|-------| | 1 | [PLACEHOLDER] | [PLACEHOLDER — from `data/derived/length_dist/` extended with per-character tally, or a new `data/derived/char_freq/*.parquet`] | [PLACEHOLDER] | | 2 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 3 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 4 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | 5 | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] |
The full 36-row table (26 letters plus 10 digits) is at `/research/state-of-com-2026/character-frequency`.
The under-represented letters — `j`, `q`, `x`, `z` — are also the letters that produce the highest-distinctiveness brand names when used well. `Xerox`, `Zappos`, `Jet`, `Qualtrics`. Rarity is signal. A character frequency table is, indirectly, a guide to which letters carry premium brandability simply by virtue of being scarce.
## The 4-letter domain reality
There are exactly 456,976 possible 4-letter .com domains using the ASCII letters `a` through `z`. Extend the alphabet to include the digits 0-9, and the number becomes 1,679,616.
Of those, the count actually available for registration as of the May 2026 zone pull is [PLACEHOLDER — from `data/derived/available_short/by_length=4/*.parquet` row count]. That is [PLACEHOLDER — share as percentage] of the all-letter set.
| Pattern | Total combinations | Registered | Available | |---------|--------------------|------------|-----------| | LLLL (letters only) | 456,976 | [PLACEHOLDER] | [PLACEHOLDER] | | LLLN (3 letters, ending number) | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | NLLL (starting number) | [PLACEHOLDER] | [PLACEHOLDER] | [PLACEHOLDER] | | All 4-char alphanumeric | 1,679,616 | [PLACEHOLDER] | [PLACEHOLDER] |
A 4-letter .com is, by widely accepted convention in the aftermarket, the floor of premium domain value. Below 4 letters and the entire market is institutionally held by a few hundred owners. Above 6 letters and the supply is effectively unlimited. The 4-letter band is the negotiable middle. The available count tells you exactly how much room there is left in that middle.
The full enumeration is published at `/research/state-of-com-2026/four-letter-availability` with a downloadable parquet of every available 4-letter combination, scored for phonetic structure (CVCV, CVVC, VCVC, etc.) so a reader can filter to the names that are actually pronounceable.
## Mazur's unrealised vision
Mazur did not stop at the 2012 prefix/suffix table. He told Cyger what he wanted to do next, and could not:
> "If I could talk about how I did the domain search for Domain Pigeon — I mentioned I did it through WHOIS. Again, WHOIS is slow though. For Lean Domain Search, I kind of took this other route where I look at the .com zone file, which is this massive file that lists all the registered domain names. I've come up with ways to query that to determine whether something is available or not. That's why it's much quicker than doing the standard WHOIS search." > > — Matt Mazur, DomainSherpa interview, November 2012 (transcript line 225)
And then the ambition, which he never built:
> "But through doing that, there's a lot of really interesting results. Like I mentioned before, you can figure out the best prefixes and suffixes, and you can also figure out what the most popular keywords are looking at all the registered domain names. There's a lot of interesting analysis I can do, and I'm thinking maybe there's a way I can automate that. For example, I think it might be interesting month to month to see what topics people are registering domain names for — what's rising. So maybe with the election coming up, more people are registering Obama or Romney-related domain names. I want to come up with a way to automate that process so I can spit out these reports showing over time trends in domain name registrations. Maybe I can automate that and that can be kind of my automatic content strategy which will get indexed and people will share and things like that." > > — Matt Mazur, DomainSherpa interview, November 2012 (transcript line 229)
Here it is, fourteen years later.
The reports we publish from this point forward — the monthly trend updates, the rising-and-falling tables, the AI-in-domains tracker, the prefix and suffix movement charts — are the automation Mazur described. He pointed at the system and named it. We built it. The order of those two acts matters.
## Methodology
The corpus is the May 2026 .com zone file, pulled under our CZDS approval. Every SLD in the zone is in the corpus. No sampling. No top-N truncation during derive. Display caps on individual results pages are a templating concern; the underlying parquet stores every row.
The pipeline:
1. The raw zone (`com.txt`, approximately 25 GB) is streamed once through `scripts/build_sld_parquet.js` (Layer 2, this week per `2026_05_18_roadmap_to_launch.md`) to produce `data/derived/sld/part-*.parquet` — the canonical SLD projection in snappy-compressed parquet, one row per SLD. 2. From that parquet, downstream derive scripts produce: prefix counts (`derive_prefix_counts.js`), suffix counts (`derive_suffix_counts.js`), length distribution (`derive_length_dist.js`), trend counts including the AI-morpheme classification (`derive_trends.js`), and the 4-letter availability set (`derive_available_short.js` joined against the bloom filter). 3. Each derive runs in DuckDB against the SLD parquet. None of them reads another derive's output (DR4 — no derived-on-derived cascades). 4. Numbers in this article come from a single point-in-time pull. The pull date is stamped at the bottom of the live results pages. Subsequent pulls produce updated numbers, which are diffed against this baseline.
The bloom filter that backs `/api/check` on the live site is built by `scripts/build_bloom_streaming.js` from the same raw zone. It is calibrated to a 1% false-positive rate, sized to fit Vercel's function bundle limit. The bloom is not the source of any number in this report — every count comes from the parquet — but it is the data structure that makes the same dataset queryable in microseconds from a browser.
Two things this report explicitly does not do. It does not republish the raw zone (CZDS terms of service prohibit it; the transformation is the entire moat). And it does not expose what individual visitors searched for on `domainsleft.com` — search logging is off, the codebase is inspectable, and the analytical numbers above are derived from the zone, never from user behaviour. The anti-front-running stance is not a marketing claim. It is a property of the system.
## What this is, and what comes next
This report is a snapshot. The data refreshes on every zone-file pull, and the live pages at `/research/state-of-com-2026/*` will carry the updated numbers. The article you are reading is the May 2026 baseline; the diffs are the recurring content.
Mazur's prefix/suffix post was, by his own account, the most successful piece of content he ever published. It was picked up by tech press, cited by domain industry sites, and drove a measurable share of the traffic that took Lean Domain Search from a side project to a sale. The thing that made it work was not the brilliance of the writing. It was that the data was real, the corpus was complete, and the comparison was honest.
We have tried to clear the same bar.
---
<!-- FABRICATED-DATA-DISCLOSURE R21 four-part disclosure — surface in-thread per CLAUDE.md project-wide rule.
This article ships as the launch-day scaffold for the State-of-.com-2026 research piece. As of the 2026-05-18 publish date, the Layer 2 SLD parquet (`scripts/build_sld_parquet.js`, task #25) and the Layer 3 derives (prefix counts, suffix counts, length dist, trends, character freq, available-short — tasks #25-#28) have not yet been run. Every quantitative claim in this article is therefore tagged [PLACEHOLDER] and listed below.
Tag legend: all values below are [PLACEHOLDER] (unverified value, not safe to ship as-is). When the derives have run, each [PLACEHOLDER] is replaced with the real number sourced from the parquet path listed beside it.
WHY (applies to every entry below): The bloom is still building on the May 2026 com.txt; the SLD parquet derive is the Layer 2 task scheduled for this week per `2026_05_18_roadmap_to_launch.md`; the L3 derives are downstream of that. No real number is available at draft time. The scaffold ships first so the writing, structure, and Mazur citations can be reviewed before the data lands.
METHODS TRIED (applies to every entry below): This is the scaffold. The parquet derives are tracked as roadmap tasks #25 (SLD parquet), #26 (per-term trend pages incl. AI morpheme split), #27 (this article), and #28 (4-letter exhaustive availability). The bloom build is task #9 (in_progress). No data acquisition was attempted at draft time because the upstream derives don't exist yet.
NEXT STEPS (per-section): 1. Run `node scripts/build_bloom_streaming.js < com.txt > public/bloom-com.bin` to confirm the May 2026 zone pull (task #9). 2. Run `node scripts/build_sld_parquet.js` to produce `data/derived/sld/part-.parquet` (task #25). 3. Run `node scripts/derive_prefix_counts.js` against the SLD parquet to fill the prefix leaderboard (`data/derived/prefix_counts/.parquet`). 4. Run `node scripts/derive_suffix_counts.js` to fill the suffix leaderboard (`data/derived/suffix_counts/.parquet`). 5. Run `node scripts/derive_length_dist.js` to fill the length distribution table (`data/derived/length_dist/.parquet`). 6. Run `node scripts/derive_trends.js` with the AI / GPT / LLM / agent / bot / chat / prompt / model / neural seed list AND the morpheme classifier (class 1 / class 2 / class 3 split per AI-in-domains section) (`data/derived/trends/.parquet`). 7. Run a new `node scripts/derive_char_freq.js` (not yet written) for the character frequency section (`data/derived/char_freq/.parquet`). 8. Run `node scripts/derive_available_short.js` against the bloom for the 4-letter availability section (`data/derived/available_short/by_length=4/*.parquet`, task #28). 9. Open this file, search for `[PLACEHOLDER` (no closing bracket needed — every flag uses the open-bracket form), and replace each value with the real number from the corresponding parquet, citing the source path inline as the R21 audit trail. 10. When all placeholders are replaced, remove this comment block. If any [PLACEHOLDER] remains at publish time, the four-part disclosure stays attached to the published article and the article ships with the gaps visible per R21.
Historical-comparison placeholders (2012 figures): Mazur's 2012 .com total is conventionally cited at ~100M; the precise number is sourced from Verisign's Q4 2012 Domain Name Industry Brief and should be inserted with that citation before publish. The 2012 prefix/suffix ranks come from `Top 5000 Most Common Domain Prefix:Suffix List.md` in this repo and are not fabricated — they are read-from-source.
End of fabricated-data disclosure. -->