Statistics about Web page quality – like the frequency of certain validation error messages and the popularity of XHTML versus HTML doctypes – are pretty esoteric stuff, and likely not of great interest to most Webmasters. But they are of interest to the people who build the tools that build the Web, like the fine folks behind the W3C Validator [1].
If you're one of those people, or you're just interested in Web site quality, then you might be interested in this statistical overview of things that Nikita examines about each page she validates. (If you're unfamiliar with Nikita, you can see a sample of what Nikita reports about a site [2] or even take Nikita for a free test drive [3].)
The data for each topic below are summarized as a table and, where appropriate, a pie chart. The data are an aggregate of the sites that Nikita has seen; no particular site is singled out. You can jump directly to:
In March 2008 Philip wrote a small program to review the data generated by Nikita's most recent crawls. The program ignored duplicate crawls of the same site and all sites where Nikita saw fewer than 30 pages. This left 360 crawls of unique sites. From each of these, the program selected 30 pages at random for a corpus of 10800 pages. (10800 =p
There are a total of 540726 validation messages in the sample containing almost 5400 unique messages. Philip decided (somewhat arbitrarily) to show only the 25 most frequent messages in the table below. These 25 represent just over half (55%) of the sample.
Note that the message 'required attribute "ALT" not specified' appears twice, once with the attribute in uppercase and once in lowercase (an HTML/XHTML difference). If Philip was to combine these two entries, it would be the most common error message by a large margin. Even without combining them, it's still the most frequent.
| Instances | Portion of Total | Message |
|---|---|---|
| 41196 | 7.62% | |
| 40187 | 7.43% | |
| 37662 | 6.97% | |
| 21067 | 3.90% | |
| 20055 | 3.71% | |
| 16985 | 3.14% | |
| 16658 | 3.08% | |
| 15134 | 2.80% | |
| 9669 | 1.79% | |
| 8014 | 1.48% | |
| 7768 | 1.44% | |
| 7528 | 1.39% | |
| 7207 | 1.33% | |
| 6779 | 1.25% | |
| 5457 | 1.01% | |
| 5371 | 0.99% | |
| 4610 | 0.85% | |
| 4336 | 0.80% | |
| 4285 | 0.79% | |
| 3695 | 0.68% | |
| 3686 | 0.68% | |
| 3606 | 0.67% | |
| 3059 | 0.57% | |
| 2805 | 0.52% | |
| 2624 | 0.49% |
A page's encoding is sometimes also called its "character set" or "charset" for short.
In the pie chart and table below you can see that the vast majority of encoding declarations are either UTF-8 or ISO-8859-1. About half of the pages in the sample (52%) declare an encoding of UTF-8 and four in ten (41%) declare ISO-8859-1. The third most popular is Windows-1252 (an ISO-8859-1 superset) with a meager 2% of the total.
When no encoding is specified, Nikita usually defaults to ISO-8859-1 per HTTP rules [11].
In the table below, the total number of encodings found is greater than the number of pages in the sample because many pages declare multiple encodings. (Nikita will warn you if she finds multiple encodings declared for a page on your site.)
| Instances | Portion of Total | Encoding |
|---|---|---|
| * Philip counted the encodings "UTF-8" and "UTF8" as the same. For the record, the former is the overwhelmingly more popular expression with 7,423 occurrences compared to 23 for the latter. | ||
| 7446 | 52.43% | UTF-8 * |
| 5814 | 40.94% | ISO-8859-1 |
| 309 | 2.18% | WINDOWS-1252 |
| 180 | 1.27% | ISO-8859-15 |
| 173 | 1.22% | WINDOWS-1251 |
| 91 | 0.64% | WINDOWS-1250 |
| 58 | 0.41% | WINDOWS-1257 |
| 35 | 0.25% | ISO-8859-9 |
| 30 | 0.21% | ISO-8859-2 |
| 29 | 0.20% | EUC-JP |
| 23 | 0.16% | BIG5 |
| 7 | 0.05% | ISO-8859-7 |
| 3 | 0.02% | US-ASCII |
| 3 | 0.02% | UTF-16LE |
| 1 | 0.01% | WINDOWS-1502 |
| 1 | 0.01% | WINDOWS-1253 |
Web pages can declare their encoding in four different places: in the HTTP Content-Type header, inside the HTML in the META HTTP-EQUIV Content-Type tag, in the file's BOM, or in the XML declaration. (You might want to read about how Nikita divines a Web page's encoding [12] from this jumble of information.)
In the pie chart and table below you can see that almost ⅔ of the encodings are specified via a META tag and most of the rest were declared via HTTP.
In the table below, the number of encoding sources is greater than the number of pages in the sample because many pages declare their encoding in multiple places. (Nikita will warn you if she finds multiple encodings declared for a page on your site.)
| Instances | Portion of Total | Source |
|---|---|---|
| 9021 | 63.61% | META HTTP-equiv tag |
| 4299 | 30.32% | HTTP response header |
| 737 | 5.20% | Fallback to default |
| 123 | 0.87% | The file's BOM (byte order mark) |
| 1 | 0.01% | The XML declaration |
There are 41 distinct doctypes in the sample. For purposes of discussing them here, Philip considered doctypes the same if they used the FPI (Formal Public Identifier). For example, these two doctypes were considered equivalent:
Philip chose not to display doctypes that represented < 1% of the sample.
In the pie chart and table below, you can see that XHTML doctypes represent a little more than ⅔ of the sample, with HTML (of course) making up the remainder. Interestingly, transitional doctypes (both HTML and XHTML) dominate the field.
In the table below, the number of doctypes sums to less than the number of pages because Philip ignored doctypes that represented < 1% of the total.
| Instances | Portion of Total | Source |
|---|---|---|
| 4123 | 43.93% | XHTML 1.0 Transitional |
| 1994 | 21.24% | XHTML 1.0 Strict |
| 1655 | 17.63% | HTML 4.01 Transitional |
| 533 | 5.68% | HTML 4.01 Strict |
| 249 | 2.65% | HTML 4.0 Transitional |
| 211 | 2.25% | XHTML 1.1 |
A media type is also sometimes called a "content type". nearly all of the pages in this sample (> 99.6%) use the text/html media type except for the one-third of one percent that use application/xhtml+xml.
Nikita's view of the world might be slightly skewed away from the application/xhtml+xml for a couple of reasons. First of all, Nikita sends an Accept header of */* and many servers capable of sending application/xhtml+xml might do so only if they find that exact string in the Accept header sent by the client.
Second, Nikita doesn't attempt to masquerade as a browser via her user agent string; it is simply "Nikita the Spider" and doesn't contain words like "Mozilla", "Opera", "WebKit", "KHTML", etc. that might convince servers that shes's capable of handling XHTML. It's likely that cautious servers choose to send text/html to Nikita.
The overwhelming dominance of text/html makes the pie chart and table below superfluous but they're here for completeness.
Unlike some of the other properties discussed in this article, there's a simple 1:1 relationship between pages and media types. Each page has exactly one which is specified in the HTTP Content-Type header. (It's possible to specify something else in an HTML META HTTP-EQUIV tag, but this is non-standard and of no practical use. Nikita doesn't look for media types specified in the page contents.)
| Instances | Portion of Total | Source |
|---|---|---|
| 10765 | 99.68% | text/html |
| 35 | 0.32% | application/xhtml+xml |
The statistics above give an overview of what Nikita has seen on the sites she's crawled recently.
The most frequent validation messages are probably not a great surprise to anyone who has hand-coded HTML. It's unfortunately easy to forget the alt attribute on img tags, and the second-most common error ("reference not terminated by REFC delimiter") is a sure sign of unescaped ampersands (&, ASCII 38). Validators are excellent tools for catching these sorts of mistakes.
The frequent use of META tags to specify encodings speaks to a common inability to set encodings via HTTP headers, or an unawareness of that ability.
The most surprising observation to me is the widespread use of the transitional doctypes. At the time this data was collected in 2008, the transitional doctypes were already more than eight years old which is about half the lifetime of the Web itself. That's quite the lengthy transition...
Philip believes his sample is a sufficiently random representation of what Nikita sees, but what Nikita sees doesn't represent the Web as a whole. For one thing, Nikita has mostly been promoted in English-speaking venues. Also, it's reasonable to assume that those who ask Nikita to crawl their site are more aware of Web standards than the average Web site author. To quote How To Lie With Statistics [13], "The result of a sampling study is no better than the sample it is based on."
Originally posted by Philip Semanchuk at Nikita The Spider [14]
This article is copyrighted to Philip Semanchuk under an attribution, non-commercial, share-alike Creative Commons License [15].
Links:
[1] http://lists.w3.org/Archives/Public/www-validator/2008Feb/0015.html
[2] http://nikitathespider.com/reports/sample/
[3] http://nikitathespider.com/start.spy
[4] http://nikitathespider.com/articles/ByTheNumbers/#methods
[5] http://nikitathespider.com/articles/ByTheNumbers/#validation
[6] http://nikitathespider.com/articles/ByTheNumbers/#encodings
[7] http://nikitathespider.com/articles/ByTheNumbers/#EncodingSources
[8] http://nikitathespider.com/articles/ByTheNumbers/#doctypes
[9] http://nikitathespider.com/articles/ByTheNumbers/#MediaTypes
[10] http://nikitathespider.com/articles/ByTheNumbers/#conclusion
[11] http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
[12] http://nikitathespider.com/articles/EncodingDivination.html
[13] http://en.wikipedia.org/wiki/How_to_Lie_with_Statistics
[14] http://nikitathespider.com/articles/ByTheNumbers/
[15] http://creativecommons.org/licenses/by-nc-sa/3.0/
[16] http://nikitathespider.com/articles/ByTheNumbers/