Doctype Classes
|
|
Doctype (Generic Document Type)
|
|
^
AUTODETECT
AUTODETECT is a special kind of doctype that really isn't a doctype at all. Although it is installed from the viewpoint of C++ as a doctype in dtreg (see src/dtreg.cxx), the doctype registrary, it does not handle parsing or presentation and only serves to map to other doctypes. It uses a complex combination of file contents and extension analysis to determine the suitable doctype for processing the file.
The identification algorithims and inference logic have been designed to be smart enough to provide a relatively fine grain identification. The analysis is based in large part upon content analysis in contrast to magic or file extension methods. The later are used as hints and not for the purpose of identification. These techniques allow the autodetector to distinguish between several different very similar doctypes (for example MEDLINE, FILMLINE and DVBLINE). It allows one to index whole directory trees without having to worry about the selection of doctype. It can also detect many doctypes where there are, at current, no suitable doctype class available or binary files not probably intended for indexing (these include misc files from FrameMaker, SoftQuad Author/Editor, TeX/METAFONT, core files, binaries etc). At current ALL doctypes available are identified. For doctypes that handle the same document formats but for different functions ( eg. HTML, HTML-- and HTMLMETA) given that smart and ESP are quite different traits, one must specify the document parser or the most general default parser would be chosen (eg. HTML for the entire class of HTML files).
Should the document format not be recognized by the internal logic it then appeals to a user editable magic file for identification. If the type is identified as some form of "text", viz. not as some binary or other format, then it is associated with the PLAINTEXT doctype.
Since it has proved acurate, robust and conforable it is the default doctype for the BSn indexer.
AUTODETECT
\
Doctypes
After the analysis and identification the parsing
and doctype name is altered to reflect its real
doctype.
| Extension | Doctype Hint |
|---|---|
| .text .txt .TXT | "PLAINTEXT" |
| .medline .med .MED | "MEDLINE" |
| .film .flm .FLM | "FILMLINE" |
| .dvb .DVB | "DVBLINE" |
| .bibtex .BTX | "BIBTEX" |
| .refer .REF | "REFERBIB" |
| .html .htm .HTM | "HTML" |
| .gif .GIF | If not BINARY or FTP1) then IMAGE (GIF) |
| .gils .gil .GIL | "GILS" |
| .sgml .sgm .SGM | "SGMLNORM" |
| .xml .XML | "XML" |
| .iafa .IAF | "IAFADOC" |
| .dif .DIF | "DIF" |
| .roads | "ROADSDOC" |
| .tif .tiff | If not BINARY or FTP1) then IMAGE (TIFF) |
| .whois++ .WHO | "IKNOWDOC" |
The above table is NOT UP TO DATE but a sample.
1) If a ",info", resp. ".iafa", file exists then use the BINARY, resp. FTP, doctype. the IAFA description file.
Note: The file name extension is not taken as 100% identification if a match exists but as a hint. The content is still often checked to confirm that the "guess" is reasonable.