README.parser
=============

This document gives a short explanation about the verious parsing stages
within the KHTML component.

When HTML is fed into KHTML it takes 3 stages before it is put onto the
screen:

Stage 1.: The Tokenizer.
Stage 2.: The HTML-Parser
Stage 3.: The HTML-Layout

The Tokenizer
=============

The tokenizer is located in khtmltokenizer.cpp. The tokenizer uses the contents
of a HTML-file as input and breaks this contents up in a linked list of
tokens. The tokenizer recognizes HTML-entities and HTML-tags. Text between
begin- and end-tags is handled distinctly for several tags. The distinctions
are in the way how spaces, linefeeds, HTLM-entities and other tags are
handled.

Example I:
Normally linefeeds are treated like spaces. However, inside a <pre> tag
linefeeds are preserved.

Example II:
Normally all text is translated into tokens and added to the linked list to
be fed into the HTML-Parser. However, text within the <script> tag is fed
into a script-interpreter. (Not that any is available at the moment).

The tokenizer is completly state-driven on a character by character base.
All text passed over to the tokenizer is directly tokenized. A complete
HTML-file can be passed to the tokenizer as a whole, character by character
(not very efficient) or in blocks of any (variable) size.


The HTML-Parser
===============

The HTML-parser interprets the stream of tokens provided by the tokenizer
and constructs a tree of elements representing the document according
to the Document Object Model (DOM, see http://www.w3.org). For HTML,
one can distiguish between 3 kinds of basic Objects the document is
build up from:

* Text

Text is a basic class holding some text of the page.

* HTMLBlockElement

Elements representing a block in the document (like <hr>, <table>,
<blockquote>, <li>, ...). These elements can contain inline elements
(the ones forming paragraphs) and other block elements. Block elements
have the ability to render themselves and the inline elements, which
they contain.

* HTMLInlineElement

Inline elements are all elements, which are rendered as part of a
paragraph (eg. <b>, <img>, <tt>, ...) Inline elements do render
themselves, but are rendered by the surrounding block element. Inline
elements can't contain any block elements.

The root of all elements is the HTMLDocument.

The HTML-Layout
===============

When the complete structure of Elements and Text is build, the
HTML-layout starts: each HTMLElement is positioned. The positioning depends
on the available screen-width.

### This might change still

The positioning starts with the calculation of the minimum screen-width
required to display the complete HTML page. The calcMinSize method in
HTML-clues and HTML-objects is used for this. The minimum size is calculated
recursively through all HTML-clues.

When the minimum size is known it compared against the actual available
screen-size. If the minimum size is less than the available
screen-size the available screen size will be used as the maximum screen
size. If the minimum size is greater than the available size the minimum
size is used as the maximum screen size. In that case, if configured, a
horizontal scrollbar will be added to be able to scroll.

-----------------------------------------------------------------------------
Advanced Topics
-----------------------------------------------------------------------------

DOM
===

khtml does now use DOM Level1 (see http://www.w3.org for details) for holding
documents. Although the dom implementation isn't finished yet, it's already
quite useable. The DOM is implemented as classes with automatic memory
management. We have internal classes (the *Impl classes) holding the data
of the DOM, but the programmer uses "pointer" classes to these internal ones.
The implementations hold a reference count of how many "pointer"/API instances
are pointing to them. Once the reference count drops to 0, the implementation
gets deleted.

The dom_* files implement the core DOM, the html_* files the html DOM.
As I focused on html, all classes in the core DOM used only for XML are not
implemented.

Paragraphs
==========

Every BlockElements goes through it's children during layout. Once it
encounters an inline element, it starts a paragraph. All inline elements
(and it's children) are scanned until it encounters the next BlockElement.
Text and inline elements are put together, and a line breaking algoritm
decides, when to start a new line.

