HTML Doctype Description:

The HTML Doctype is for documents marked-up using the Hypertext Markup Language of the World Wide Web (WWW).
  Doctype Classes
        |
        |
     Doctype (Generic Document Type)
        |
        |
     SGMLNORM
        ^
        |
        |
      HTML

Although the HTML doctype is a subclass of SGMLNORM it does not require normalized HTML and handles most all HTML Markup (tag normalized or not) and entities, e.g. Ü (Ü), &amp (&), &#177 (±) etc. Engine versions prior to IB 2.2 as well as public Isearch do not support entity expansion due to limitations in indexer model and require for proper operation entity normalization of the input documents.

The doctype supports HTML 0.9, 2.0, 3.0, (version > 1.9) HTML 3.2, W3O's Cougar (version > 1.18) as well as many vendor extensions, kludges and abuses. Most common incorrect uses of HTML markup are also correctly handled. It has been tested against a very large random sample of several gigabytes of HTML data from thousands of sites and has been used to in the field to index many terabytes of HTML pages at hundreds of sites.

Most HTML authors, and nearly all browser vendors, have no background in SGML and few have bothered. The emerging trend in the WWW is to design pages to use descriptive, in contrast to content, markup, and, worse still, to exploit the specific quirks of a single browser with the aim of emulating a simple desktop publishing layout. While XML and DSSSL might be set to change this, especially given an interest in push delivery models, HTML will continue to become increasingly browser-centric and proprietary.

By default it indexes all fields that represent content (container tags) in an HTML document. Descriptive markup, such as <I> ,<B>,<TT> etc. are ignored.

Parser Levels

The parser (BSn only) knows several levels:
Basic (5)
Accept only a few tags and combine into some simple groups with plain english names. This level is for robots and web services.
Valid (4)
Accept only valid HTML, make less assumptions and complain.
Strict (3)
Accept only known HTML tags, complain about bogus use.
Normal (2)
Accept only known HTML tags
Maximal (1)
Accept all tags except some descriptive markup.
Full (0)
Accept anything.
Antihtml (-1)
Use HTML-- for processing.
The level is specified as a doctype option (eg. -o LEVEL=Level) or via LEVEL in the .ini HTML section. Either the long name or number can be specified.

META

The META tag is used to embed metadata in a HTML document. It belong in the HEAD of the document (since much HTML is non-comformant the HTML parser will accept it anywhere).

In HTML 2.0 the METAs may occur anywhere in the HEAD

<!ENTITY % head.content "TITLE & ISINDEX? & BASE? %head.extra">

<!ELEMENT HEAD O O  (%head.content) +(META|LINK)>

But in HTML 3.2 all the METAs must be grouped together.

<!ENTITY % head.content "TITLE & ISINDEX? & BASE? & STYLE? &
                            SCRIPT* & META* & LINK*">

<!ELEMENT HEAD O O  (%head.content)>

The current standard specifies META as a complex empty tag via a name/value pair: NAME and CONTENT.

<!ELEMENT META - O EMPTY>
<!ATTLIST META
        HTTP-EQUIV  NAME    #IMPLIED
        NAME        NAME    #IMPLIED
        CONTENT     CDATA   #REQUIRED    >

The HTML document type definition offers very little content structure making the development of structured databases difficult. The META-tag mechanism offers some relief.

Meta tags, <META NAME="AUTHOR" CONTENT="Elmer Fudd"> are searchable (in versions >1.9) as META.AUTHOR

  <META NAME="AUTHOR" CONTENT="Elmer Fudd"> -->
     META@        := NAME="AUTHOR" CONTENT="Elmer Fudd"
     META@NAME    := "CONTENT"
     META@CONTENT := "Elmer Fudd"
  Version > 1.9
     META.AUTHOR  := "Elmer Fudd"
In Version >1.10 <META HTTP-EQUIV="AUTHOR" CONTENT="Elmer Fudd">
  <META HTTP-EQUIV="AUTHOR" CONTENT="Elmer Fudd"> -->
     META@        := HTTP-EQUIV="AUTHOR" CONTENT="Elmer Fudd"
     META@NAME    := "CONTENT"
     META@CONTENT := "Elmer Fudd"
     AUTHOR       := "Elmer Fudd"

Versions based upon SGMLNORM >1.8 support Schema and Type specified in content, viz. <META NAME="LANGUAGE" CONTENT="(Schema=ISO6639)en"> searchable under META.LANGUAGE(ISO6639).
The keywords Schema, Scheme and Type are currently supported.

The W30 Experiemental Cougar DTD extends META to:

<!ATTLIST META
  %i18n;                           -- lang, dir for use with content string --
  http-equiv  NAME       #IMPLIED  -- HTTP response header name  --
  name        NAME       #IMPLIED  -- metainformation name --
  content     CDATA      #REQUIRED -- associated information --
  scheme      CDATA      #IMPLIED  -- select form of content -- >
The definition of Scheme= overrides the specification of Scheme in Content, eg. CONTENT="(Scheme=..)...".

This allows the development of interfaces to search for HTML pages based on meta-data, viz. content that HTML has no container for.

See also:

BASE

The doctype option or environment variable WWW_ROOT is used to synthesize— should it not have been defined— the value for <BASE HREF="..." ... >. This enables CGI interfaces (or clients) to "derive" a fully-qualified URL to the original document and resolve relative hypertext references.

In Version >1.10:

  1. The .dbi/.ini file is examined for a Pages entry in the HTTP section.
  2. else if the doctype option WWW_ROOT is defined it is used
  3. else if the doctype option HTTP_PATH is defined it is used
  4. else if the doctype option HTTP_PATH it is used
  5. else if the doctype option HTDOCS is defined it is used
  6. else the environment is checked for the above variables, processed in that order.

Should the path specified in <BASE HREF=path> have a trailing /, then the filename is concatinated. This way one may specify the same BASE for all files in the same directory.

The URL is searchable via the BASE field. Since binary data such as images, audio and video files often have descriptive names, eg. ElmerFudd.au, this can also provide a pragmatic basis for search of multimedia information.

In Version >1.10 one can also specify the name/port of the WWW server. This is designed to support multiple virtual hosts or mapping to PURL (Persistant URL) resolver.

  1. The .dbi/.ini file is examined for a Server entry in the HTTP section. A URL of type http://www.bsn.com:8080 is expected.
  2. If the .dbi/.ini file does not contain an entry then a HTTP_Server doctype option is expected.
  3. If this is not defined then the standard HTTP Server environment variables SERVER_NAME and SERVER_PORT are used to build the URL. If these are undefined then one is probably not running under a HTTPD and a file:/// method based URL is contructed (only HTML).
An interesting option is to set the Server to point to a PURL resolver instead of the Web server. This way the resource URL can persist beyond, or be de-coupled from, the URL structure within a web.

LINK

The complex values of LINK are, like META stored. The REV and REL values can be used to model external flows to and from the HTML document instance.

I18N:

Support for i18n from the W3C draft Cougar DTD:

<!ENTITY % i18n
 "lang        NAME       #IMPLIED  -- RFC 1766 language value --
  dir         (ltr|rtl)  #IMPLIED  -- default directionality --" >
is not yet available. It is planned for the next version.

Notes:

The HTML Doctype is NOT a validator and makes many assumptions. It tries to guess the intent of some of the HTML kludges in common use. It is, none-the-less, advisable to validate HTML pages and to adhere to a content model.

Version 3.x will include support for SUTRS Presentation.

See: Hypertext Markup Language draft RFCs. (eg. draft-ietf-html-spec-04.txt)


© Copyright 1995-1997   Basis Systeme netzwerk, Munich. All Rights Reserved.