Doctype Classes
|
|
Doctype (Generic Document Type)
|
|
SGMLNORM
^
|
|
HTML
Although the HTML doctype is a subclass of SGMLNORM it does not require normalized HTML and handles most all HTML Markup (tag normalized or not) and entities, e.g. Ü (Ü), & (&), ± (±) etc. Engine versions prior to IB 2.2 as well as public Isearch do not support entity expansion due to limitations in indexer model and require for proper operation entity normalization of the input documents.
The doctype supports HTML 0.9, 2.0, 3.0, (version > 1.9) HTML 3.2, W3O's Cougar (version > 1.18) as well as many vendor extensions, kludges and abuses. Most common incorrect uses of HTML markup are also correctly handled. It has been tested against a very large random sample of several gigabytes of HTML data from thousands of sites and has been used to in the field to index many terabytes of HTML pages at hundreds of sites.
Most HTML authors, and nearly all browser vendors, have no background in SGML and few have bothered. The emerging trend in the WWW is to design pages to use descriptive, in contrast to content, markup, and, worse still, to exploit the specific quirks of a single browser with the aim of emulating a simple desktop publishing layout. While XML and DSSSL might be set to change this, especially given an interest in push delivery models, HTML will continue to become increasingly browser-centric and proprietary.
By default it indexes all fields that represent content (container tags) in an HTML document. Descriptive markup, such as <I> ,<B>,<TT> etc. are ignored.
In HTML 2.0 the METAs may occur anywhere in the HEAD
<!ENTITY % head.content "TITLE & ISINDEX? & BASE? %head.extra"> <!ELEMENT HEAD O O (%head.content) +(META|LINK)>
But in HTML 3.2 all the METAs must be grouped together.
<!ENTITY % head.content "TITLE & ISINDEX? & BASE? & STYLE? &
SCRIPT* & META* & LINK*">
<!ELEMENT HEAD O O (%head.content)>
The current standard specifies META as a complex empty tag via a name/value pair: NAME and CONTENT.
<!ELEMENT META - O EMPTY>
<!ATTLIST META
HTTP-EQUIV NAME #IMPLIED
NAME NAME #IMPLIED
CONTENT CDATA #REQUIRED >
The HTML document type definition offers very little content structure making the development of structured databases difficult. The META-tag mechanism offers some relief.
Meta tags, <META NAME="AUTHOR" CONTENT="Elmer Fudd"> are searchable (in versions >1.9) as META.AUTHOR
<META NAME="AUTHOR" CONTENT="Elmer Fudd"> -->
META@ := NAME="AUTHOR" CONTENT="Elmer Fudd"
META@NAME := "CONTENT"
META@CONTENT := "Elmer Fudd"
Version > 1.9
META.AUTHOR := "Elmer Fudd"
In Version >1.10 <META HTTP-EQUIV="AUTHOR" CONTENT="Elmer Fudd">
<META HTTP-EQUIV="AUTHOR" CONTENT="Elmer Fudd"> -->
META@ := HTTP-EQUIV="AUTHOR" CONTENT="Elmer Fudd"
META@NAME := "CONTENT"
META@CONTENT := "Elmer Fudd"
AUTHOR := "Elmer Fudd"
Versions based upon SGMLNORM >1.8 support Schema and
Type specified in content, viz.
<META NAME="LANGUAGE" CONTENT="(Schema=ISO6639)en">
searchable under META.LANGUAGE(ISO6639).
The keywords Schema, Scheme and Type
are currently supported.
The W30 Experiemental Cougar DTD extends META to:
<!ATTLIST META %i18n; -- lang, dir for use with content string -- http-equiv NAME #IMPLIED -- HTTP response header name -- name NAME #IMPLIED -- metainformation name -- content CDATA #REQUIRED -- associated information -- scheme CDATA #IMPLIED -- select form of content -- >The definition of Scheme= overrides the specification of Scheme in Content, eg. CONTENT="(Scheme=..)...".
This allows the development of interfaces to search for HTML pages based on meta-data, viz. content that HTML has no container for.
See also:
In Version >1.10:
Should the path specified in <BASE HREF=path> have a trailing /, then the filename is concatinated. This way one may specify the same BASE for all files in the same directory.
The URL is searchable via the BASE field. Since binary data such as images, audio and video files often have descriptive names, eg. ElmerFudd.au, this can also provide a pragmatic basis for search of multimedia information.
In Version >1.10 one can also specify the name/port of the WWW server. This is designed to support multiple virtual hosts or mapping to PURL (Persistant URL) resolver.
Support for i18n from the W3C draft Cougar DTD:
<!ENTITY % i18n "lang NAME #IMPLIED -- RFC 1766 language value -- dir (ltr|rtl) #IMPLIED -- default directionality --" >is not yet available. It is planned for the next version.
The HTML Doctype is NOT a validator and makes many assumptions. It tries to guess the intent of some of the HTML kludges in common use. It is, none-the-less, advisable to validate HTML pages and to adhere to a content model.
Version 3.x will include support for SUTRS Presentation.
See: Hypertext Markup Language draft RFCs. (eg. draft-ietf-html-spec-04.txt)