HTMLMETA Doctype Description:

  Doctype Classes
        |
        |
     Doctype (Generic Document Type)
        |
        |
      COLONDOC
        ^
        |
        |
     HTMLMETA

Metadata is data about an information resource. It is generally structured into a number of elements. There are many different attribute sets and standards for metadata. The HTMLMETA Doctype has been designed to help create and manage meta-databases derived from documents marked-up with Meta-data using the Hypertext Markup Language (HTML) of the World Wide Web (WWW). The HTMLMETA doctype has been designed to support a flexible model for the encapsulation of generic metadata and is equally well suited to Dublin Core, GILS, EAD and other standards.

The documents processed by the doctype are HTML with meta-information embeded in the <HEAD>...</HEAD> of a document using attribute tags rather than container element content. This form is compatible with the HTML standards. The main difference between HTMLMETA and HTML doctypes are:

  1. Only the fields in <HEAD> are processed.
  2. The field names are processed differently.
  3. Different Presentation.

The parser does not enforce the use of any particular attribute set. Using the GILS core attribute set, for instance, the HTMLMETA doctype can be used to automatically create GILS-compliant Z39.50 servers to HTML data resources.

The default (RAW) data format produced is an experimental XML format suitable for export, remote meta-indexing, whois++ (distributed), X.500/LDAP or import into another Isite or interBasis database using the XML or GILS document handlers.

Mark-up Conventions

An HTML file containing the fragement:
<META NAME="AUTHOR" CONTENT="Elmer Fudd">
<META NAME="COMPANY" CONTENT="Acme Inc.">
<META NAME="ADDRESS" CONTENT="Looney Place 4, Acme Acres">
<META NAME="VERSION" CONTENT="%Z%%Y%%M% %I% %G% %U% BSN">

Contains the fields AUTHOR, COMPANY, ADDRESS and VERSION. Each are fully searchable and mappable in the Z39.50 server to attribute OIDs.

The above example would produce SUTRS Presentation fragment:

AUTHOR:Elmer Fudd
COMPANY:Acme Inc.
ADDRESS:Looney Place 4, Acme Acres
VERSION:@(#)HTMLMETA.i.html 1.11 11/27/98 23:19:35 BSN1)
LOCATION:"URL to the original HTML"2)
 
1) If document was processed by sccs-get to resolve the %?% symbols.
2) "The URL to the original HTML" is derived from the <BASE ... > or synthetically derived. For a GILS or other meta-search facility it can point to PURL (Persistant URL) resolver elsewhere, such as gils.org, that, in turn, maps to the resource. The later is of particular relevance for maintaining links to resources for robots and other page gatherers. See HTML BASE processing.

The XML representation would be:
<!ENTITY LOCATION CDATA "URL" -- Indexed Document -->
<AUTHOR>Elmer Fudd</AUTHOR>
<COMPANY>Acme Inc.</COMPANY>
<ADDRESS>Acme Inc.</ADDRESS>
<VERSION>@(#)HTMLMETA.i.html 1.11 11/27/98 23:19:35 BSN</VERSION>

An entity notation has been chosen instead of:
<LOCATION>http://www.your.org/blah</LOCATION>
         or
<LOCATION HREF="http://www.your.org/blah"/>
so as not to break emerging DTDs.

Note: The name for the LOCATION as well as the XML notation is subject to change in future versions.

In Content, if following the '"' is a (Keyword=YYY)... it is specially processed. At current there are handlers for two keywords: Schema and Type. If the keyword is not in this set of registered words (with handlers) it is considered part of content, eg.
<META NAME="AUTHOR" CONTENT="(Fudd)"> is read as <AUTHOR>(Fudd)</AUTHOR>.

Specifying Complex Attributes

To specify a schema (for container data typing) one includes it in CONTENT as
<META NAME=Attribute CONTENT="(Schema=Type)Content">
The (more intuitive) alternative: <meta name="Attribute(Type)" content="Content"> is not valid HTML given the '('. It is, none-the-less, in the HTML tradition, accepted.

In the presentation both are mapped to: <Attribute Schema="Type">
Example:

Which would produce an XML of: <DATE SCHEMA="ISO">1996-04-19</DATE>

In the above example the content data would be stored under the field DATE(ISO)

Since the draft Cougar DTD from the W3C defines:

<!ATTLIST META
  %i18n;                      -- lang, dir for use with content string --
  http-equiv  NAME  #IMPLIED  -- HTTP response header name  --
  name        NAME  #IMPLIED  -- metainformation name --
  content     CDATA #REQUIRED -- associated information --
  scheme      CDATA #IMPLIED  -- select form of content --
  <
HTMLMETA interprets as equivalent:
  1. <META NAME="DATE(ISO)" CONTENT="1996-04-19>
  2. <META NAME="DATE" CONTENT="(SCHEMA=ISO)1996-04-19>
  3. <META NAME="DATE" SCHEME="ISO" CONTENT="1996-04-19>
The Cougar Model is just a draft and is not stable. It is subject to change. HTMLMETA will track the changes to reflect the emerging dtd.

Marking Up heirarchical Data

To represent heirarchical data the HTMLMETA framework uses the '.' (dot-notation), as
<META NAME="Tag0.Tag1..." CONTENT="Content"> eg.
<META NAME="AUTHOR.NAME" CONTENT="Elmer Fudd">
<META NAME="AUTHOR.AGE" CONTENT="60">

Produces:
<AUTHOR><NAME>Elmer Fudd</NAME><AGE>60</AGE></AUTHOR>

By extending to several levels of '.' one can define a simple heirarchical model of meta-data suitable for the development of sophisticated models and services.

The order of the METAs is significant. To markup:
  <TITLE>A Title
   <NAME>A Name</NAME>
  </TITLE>

One enters:
 <META NAME="TITLE" CONTENT="A Title">
 <META NAME="TITLE.NAME" CONTENT="A Name">

By contrast, the META markup:
 <META NAME="TITLE" CONTENT="A Title">
 <META NAME="NAME" CONTENT="A Name">
 <META NAME="TITLE.NAME" CONTENT="Another Name">

Would produce:
  <TITLE>A Title</TITLE>
  <NAME>A Name</NAME>
  <TITLE><NAME>Another Name</NAME></TITLE>

The convention is somewhat different from the one proposed at W3C Distributed Indexing and Searching Workshop, May 28-29, 1996:

<META NAME= "schema_identifier.element_name"
           CONTENT="string data">
In general HTML documents marked-up with this convention will be correctly interpreted. The major difference between it and the semantics of the HTMLMETA framework occurs when the element_name includes '.': <META NAME="GNU.FOO.BAR" CONTENT="Aardvarks">.
HTMLMETA framework
field BAR as a sub-element of FOO under GNU
W3C Workshop framework
field FOO.BAR within the named schema GNU. This can be viewed as FOO.BAR under GNU. The order of the META tags is invariant.

Another common model and typical among Dublin Core implementations is:
   <META NAME="schema_identifier.element_name" CONTENT="(Type=element_type)Content">. In the HTMLMETA framework it would be interpreted as:
   <META NAME="schema_identifier.element_name.element_type" CONTENT="Content">

The advantages of the HTMLMETA framework are:

Although many document/data models can be "flattened", especially for S/R services a heirarchical model can provide advantages. Examine, for example, the following GILS attributes:

<!ELEMENT Contact-Name               - O (#PCDATA)>
<!ELEMENT Contact-Organization       - O (#PCDATA)>
<!ELEMENT Contact-Street-Address     - O (#PCDATA)>
<!ELEMENT Contact-City               - O (#PCDATA)>
<!ELEMENT Distributor-Name           - O (#PCDATA)>
<!ELEMENT Distributor-Organization   - O (#PCDATA)>
<!ELEMENT Distributor-Street-Address - O (#PCDATA)>
<!ELEMENT Distributor-City           - O (#PCDATA)>
Using something like this, instead:
<!ELEMENT Name           - O (#PCDATA)>
<!ELEMENT Organization   - O (#PCDATA)>
<!ELEMENT Street-Address - O (#PCDATA)>
<!ELEMENT City           - O (#PCDATA)>
<!ENTITY % address "NAME & ORGANIZATION & STREET-ADDRESS & CITY">
<!ELEMENT CONTACT O O  (%address)>
<!ELEMENT DISTRIBUTOR O O  (%address)>
has the advantage that the kind of content is more clearly defined as an object.

On would then markup as:
<META NAME="Distributor.City" CONTENT="Munich, Germany">

Mapping to a heirarchical DTD with ORGANIZATION, STREET-ADDRESS and City as subelements of CONTACT and DISTRIBUTOR allows for an implicit and interoperable search for ORGANIZATION which would include a search of both CONTACT and DISTRIBUTOR.

Real world example

The following example illustrates the use of Dublic Core Metadata in HTML.

<-- Ripped out from http://info.ox.ac.uk/~lou/wip/metadata.syntax.html -->
<meta name="title" content="A syntax for Dublin core Metadata:
 Recommendations from the second Metadata Workshop">
<meta name="author" content="Lou Burnard">
<meta name="author" content="Eric Miller">
<meta name="author" content="Liam Quin">
<meta name="author" content="C. M. Sperberg-McQueen">
<meta name="subject" content="metadata">
<meta name="subject" content="Second Metadata Workshop (Warwick, U.K.)">
<meta name="date" content="1996">
<meta name="date" content="(Schema=ISO)1996-04-19">
<meta name="object-type" content="article">
<meta name="form" content="HTML 2.0">
<meta name="form" content="(Schema=IMT)text/html">
<meta name="identifier"
      content="(Schema=URL)http://info.ox.ac.uk/~lou/wip/metadata.syntax.html">
<meta name="identifier"
      content="(Schema=URL)http://www.uic.edu/~cmsmcq/tech/metadata.syntax.html">
<meta name="source" content="(none)">
<meta name="language" content="(Schema=ISO6639)en">

The Meta-record fragement would be:

<!ENTITY LOCATION  CDATA "http://info.ox.ac.uk/~lou/wip/metadata.syntax.html">
<author>Lou Burnard</author>
<author>Eric Miller</author>
<author>Liam Quin</author>
<author>C. M. Sperberg-McQueen</author>
<subject>metadata</subject>
<subject>Second Metadata Workshop (Warwick, U.K.)</subject>
<date>1996</date>
<date SCHEMA="ISO">1996-04-19</date>
<object-type>article</object-type>
<form>HTML 2.0</form>
<form SCHEMA="IMT">text/html</form>
<identifier SCHEMA="URL">
	http://info.ox.ac.uk/~lou/wip/metadata.syntax.html</identifier>
<identifier SCHEMA="URL">
	http://www.uic.edu/~cmsmcq/tech/metadata.syntax.html</identifier>
<source>(none)</source>
<language SCHEMA="ISO639>en</language>

As with all smart doctypes the HTML presentation includes an autodetection of implied hyperlinks, URLs and email addresses.

Note: The LOCATION refers to the derived URL of the document being indexed and this might be different from the encoded URL identifier. The later is the more definitive source and can well have a different encoded meta-record (different version).

Meta Level Grain

The HTML presentation, in turn, also contains Metadata and link information about the resource record derived from the <LINK...> content in the original resource.

The <LINK REL="BASE" HREF=URL> encodes the link from the meta record to the original HTML that contained the meta.

HTTP-EQUIV metadata for DATE-MODIFIED and EXPIRATION are transfered from the original resource to the meta-record. Rating system information such as PICS is also reflected in the META record.
<META http-equiv="PICS-Label" content='(PICS-1.0 "http://www.classify.org/safesurf/" l on "1997.05.02T13:23-0100 r (SS~~000 1)'> Is a simple PICS classification for a HTML resource. While this PICS lable refers to the resource and might not apply to the metarecord it is still transfered.

The <HEAD> would contain:
<META http-equiv="PICS-Label" content='(PICS-1.0 "http://www.classify.org/safesurf/" l on "1997.05.02T13:23-0100 r (SS~~0 00 1)'>
And the content would contain:

PICS-Label: (PICS-1.0 "http://www.classify.org/safesurf/" l on "1997.05.02T13:23-0100 r (SS~~000 1)

or as XML:
<PICS-Label>(PICS-1.0 "http://www.classify.org/safesurf/" l on "1997.05.02T13:23-0100 r (SS~~000 1)</PICS-Label>

The LINK attributes are transfered to the HTML and mapped into the XML as:
<LINK Type=Name TITLE=TITLE HREF=URL>
To <Type NAME=Name TITLE=Title HREF=URL/> , eg. tags REV and REL.

The <LINK REL="META" SRC="..."> and <LINK REV="META" SRC="..."> are special! They are used to provide external meta-record definitions to a HTML resource. There are no definitive standards yet.

The <LINK REL="SCHEMA.XXX" HREF="..." and <LINK REL="DTD.XXX" HREF="..."
are used to define the descriptive schema and the DTD for the structure.

Lets look at a more complicated, "real-world", example:
<!-- Dublin Core Metadata Package -->
<META NAME="DC.title" CONTENT="Proposed Encodings for Dublin Core Metadata">
<META NAME="DC.author.name" CONTENT="Dave Beckett">
<META NAME="DC.author.email" CONTENT="D.J.Beckett@ukc.ac.uk">
<META NAME="DC.subject.keyword" CONTENT="metadata, dublin core">
<META NAME="DC.identifier.URL" CONTENT="http://www.hensa.ac.uk/pub/metadata/dc-encoding.html">
<META NAME="DC.form.imt" CONTENT="text/html">
<META NAME="DC.language" CONTENT="en">
<LINK REL="SCHEMA.dc" TITLE="Dublin Core" HREF="http://purl.org/metadata/dublin_core_elements">
<LINK REL="DTD.dc" TITLE="-//OCLC//DTD Dublin Core 1.0//EN" HREF="http://purl.org/metadata/dublin.dtd">
The LINK specifies the schema used for the METAs whence is a kind of meta-DTD model about the structure of the meta-record. The DTD.dc refers to the DTD for the model, so one has a

<?XML ENCODING="Charset (IANA Name)" VERSION="1.0"?>
<!DOCTYPE DC PUBLIC "-//OCLC//DTD Dublin Core 1.0//EN" "http://purl.org/metadata/dublin.dtd">
<!ENTITY LOCATION  CDATA "http://www.bsn.com/blah/blah/blah">
<DC NAME="Dublin Core" SCHEMA="http://purl.org/metadata/dublin_core_elements">
<TITLE>Proposed Encodings for Dublin Core Metadata"</TITLE>
<AUTHOR>
  <NAME>Dave Beckett</NAME>
  <EMAIL>D.J.Beckett@ukc.ac.uk</EMAIL>
</AUTHOR>
<SUBJECT>
  <KEYWORD>metadata, dublin core</KEYWORD>
</SUBJECT>
<INDENTIFIER>
  <URL>http://www.hensa.ac.uk/pub/metadata/dc-encoding.html&;lt;/URL>
</INDENTIFIER>
<FORM>
 <IMT>text/html</IMT>
</FORM>
<LANGUAGE>en</LANGUAGE>
</DC>

Since meta-data markup conventions are in flux and the HTML should survive past the next change, via the doctype option MAPPING one can specify a map file to map the container specified in <META NAME=..> to an alternative name. This feature has been provided to allow for the unification of field names between collections— eg. to interpret <META NAME="DC.form.imt" CONTENT="text/html"> and <META NAME="DC.form" CONTENT="(Schema=imt)text/html">— and doctypes.

The XML produced is NOT VALIDATED but is assumed to have been correct. It is the duty of the HTML authors to confirm the correctness of the mark-up.

If <LINK REL="DTD.dc" ...> is not defined then the XML record produced would be without the doctype declaration.

Experimental XSpace Meta Content Framework (MCF)

In addition the HTML presentation contains a:
<EMBED SRC="URL.mcf">
as a link to a MCFified view of the hyper-content of the original document. This dynamic reference is a catalog of the links from the document external to the document and meta-record.

The MCF file format is from Apple Research's ProjectX. It was selected due to the availability of plug-ins for popular browsers on the Windows and Mac (68K and PPC) platforms. Since the MCF file format is very trivial to parse and Apple is not in Redmond, a port to other platforms (including Java) is not unreasonable.

Example: The MCF file for this file: HTMLMETA.i.mcf.

The MIME type for the MCF is: Content-type: image/vasa

Together with a MCFed Web one has the possibility to create a VRML-like geometric navigation model. It is easily reproduced from a HyTime model.

Note: The support in HTMLMETA for MCF is experimental and subject to change.

Producing META-record content

At BSn/IBU's publishing services we have developed several customer Webs on the basis of database driven content. Since the HTML is produced by computer programs from a database, the META mark-up is always up-to-date with the content.

While Relational database models are poorly suited to search and retrieval services they are appropriate for the management of content, eg. as the basis for editorial systems. Full blown RDMS products are not always required. For many applications where a single author creates record content, one of the simple low-cost personal database packages is sufficient. Even with shared authoring of content a RDBMS is not allways warranted. We have developed web-centered workflow systems around wxPython and the interBasis WebCat. In other projects where the goal was a Web and not a database target we have exported from Claris FileMaker Pro and MS Access databases maintained by the customer on their desktop to colon seperated lists and from there to COLONDOC, XML, or (for print) Framemaker's MIF with a set of script driven programs.

These files are then converted by a template driven process into HTML. The advantage of this production process, aside from substantially reducing the costs of creating and maintaining a WWW pressence for our customers, is to de-couple the HTML design and corporate identity from the content, reduce hyperlink errors and allow for the management of meta-views.

Since the data is designed around a hierarchical structured object-data model, HyTime, SGML, XML (to any of a number of DTDs), HTML and other structured data formats such as MARC are all easily implemented. This model enables the encoding of META records, push content delivery, CDF and the emerging use of SGML in the Internet. For a WWW-presence the target of the conversion is often HTML.

Since one often wants to change the presentation and allow the evolution of new formats, the HTML conversion is driven by templates similar to the master pages offered by publishing packages such as Framemaker®. The use of Master Pages allows one to alter the look and feel as well as corporate identity without substantial cost.

With manual page generation it is somewhat more complicated but many templates exist to help automate the procedure. BSn has also developed some tools to merge meta-data with HTML.

The point of embeding the meta data record into the HTML source is that it is scalable from simplistic low cost solutions up to full blown sophisticated editorial systems. The issue is pragmatics and encoding the meta record in HTML holds the chance of being widely accepted.

From the viewpoint of gathering meta-records there is little difference (other than cost and data consistency) between human coded and machine produced copy.

Additional Information

See also: ROADS (IAFA), IAFADOC, FTP and GILS doctypes.

The MIME type for Raw is "text/xml"

In preperation:


© Copyright 1995-1997   Basis Systeme netzwerk, Munich. All Rights Reserved.