Rate this page

Generate an HTML 5 DOCTYPE using XSLT

Tested with xsltproc on

Debian (Etch, Lenny, Squeeze)
Ubuntu (Hardy, Intrepid, Jaunty, Karmic, Lucid, Maverick, Natty, Oneiric, Precise, Quantal)

Tested with Xalan on

Debian (Etch, Lenny, Squeeze)
Ubuntu (Hardy, Intrepid, Jaunty, Karmic, Lucid, Maverick, Natty, Oneiric, Precise, Quantal)

Tested with Saxon-B on

Debian (Lenny, Squeeze)
Ubuntu (Intrepid, Jaunty, Karmic, Lucid, Maverick, Natty, Oneiric, Precise, Quantal)

Tested with Saxon-6 on

Debian (Lenny, Squeeze)
Ubuntu (Hardy, Intrepid, Jaunty, Karmic, Lucid, Maverick, Natty, Oneiric, Precise, Quantal)

Objective

To include an appropriate DOCTYPE when generating an HTML 5 document using an XSLT stylesheet

Background

Conforming HTML documents must begin with a document type declaration (DOCTYPE). Prior to HTML version 5 this had included both public and system identifiers, for example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

HTML 5 documents must still have a DOCTYPE, but the preferred form contains only the name of the root element:

<!DOCTYPE html>

(At the time of writing the HTML 5 specification had not been finalised, however the above is unlikely to change. It was correct as of November 2012. The declaration is case-insensitive.)

XSLT has provision by means of the xsl:output element to specify what should appear within the DOCTYPE, provided that it has a public identifier and/or a system identifier:

<xsl:output method="html"
 doctype-public="-//W3C//DTD HTML 4.01//EN"
 doctype-system="http://www.w3.org/TR/html4/strict.dtd"/>

However, if both public and system identifiers are omitted then so too is the entire DOCTYPE.

Scenario

Suppose that you wish to generate a conforming HTML 5 document using an XSLT 1.0 or XSLT 2.0 stylesheet.

The output document is intended for publication but the stylesheet will remain private. For this reason, if any compromises are necessary you would prefer that they affect the stylesheet rather than the output document.

(A method which optimises for robustness of the stylesheet is presented later as an alternative.)

Method

A crude but effective way to produce the document type declaration is to treat it as a string of raw text. It can then be generated using an xsl:text element:

<xsl:text disable-output-escaping='yes'>&lt;!DOCTYPE html&gt;</xsl:text>

This should be placed at the point in the stylesheet immediately prior to where the root element of the output document is generated, for example:

<xsl:template match="/document">
 <xsl:text disable-output-escaping='yes'>&lt;!DOCTYPE html&gt;</xsl:text>
 <html>
  <head>
   <title>The Theory and Practice of Oligarchical Collectivism</title>
  </head>
  <body>
   <xsl:apply-templates/>
  </body>
 </html>
</xsl:template>

It is necessary to disable output escaping (using the disable-output-escaping attribute) because otherwise the less than character at the start of <!DOCTYPE would be rendered as &lt;.

There should be an xsl:output element in the stylesheet setting the output method to html, but without specifying a public or system identifier:

<xsl:output method="html"/>

This method has two significant drawbacks:

For these reasons this cannot be considered a fully robust method unless the manner in which the document will be processed is known in advance. Its advantage is that the resulting document type declaration is an entirely normal one, requiring no compromise to the content of the output document.

Alternatives

Using a DOCTYPE legacy string

The first two working drafts of the HTML 5 specification permitted no variation in the document type declaration beyond the case of the letters, however in recognition of the difficulties this could cause the working draft of 12th February 2009 introduced support for the ‘DOCTYPE legacy string’. This has the form:

<!DOCTYPE html SYSTEM "about:legacy-compat">

Because this has a system identifier it can be generated using an xsl:output element:

<xsl:output method="html" doctype-system="about:legacy-compat"/>

From an XSLT perspective this is a significantly more robust method for identifying the document type as HTML 5. The main drawback is that it shows through into the output document (albeit in a manner that should be invisible to most users), which may or may not be acceptable to you.

Tags: html | xslt