Introduction to XML

Defining XHTML’s Abstract Syntax: XML
  1. Element Type Declarations
  2. Attribute List Declarations
  3. Entity declaration
  4. DTD files

Defining XHTML’s Abstract Syntax: XML
  • HTML is SGML-based – Standard Generalized Markup Language
  • XHTML is derived from HTML to conform to XML standards.
  • XHTML –  EXtensible HyperText Markup Language – Stricter version of HTML – does not allow user to get away with lapses in coding and structure.
  • XML – eXtensible Markup Language. XML is a markup language much like HTML. XML was designed to store and transport data. XML was designed to be self-descriptive.

XML
  • The abstract syntax of a version of XHTML is defined by a set of text files known collectively as an XML Document Type Definition (DTD).
  • Basic elements of DTD:
    • Element Type Declaration
    • Attribute list declaration
    • Entity declaration
 
  • Consider a simple example of XHTML DTD:
<!ELEMENT html (head, body)>
<!ATTLIST html
lang NMTOKEN #IMPLIED
xml:lang NMTOKEN #IMPLIED
dir (ltr|rtl) #IMPLIED
id ID #IMPLIED
xmlns CDATA #FIXED ‘http://www.w3.org/1999/xhtml’>
<!ENTITY gt “&#62;”>

Element Type Declaration:
<!ELEMENT html (head, body)>
  • Element Type Declarations – are used to specify the set of all valid elements in the language defined by the DTD.
  • The XHTML DTD contains exactly one element type declaration for each element in the language.
  • The string immediately following ELEMENT is the name of the element type being declared, in this case html.
 
  • The information following the element type name is known as the content specification for the element;
    • it provides information about the valid content of the element type being declared.
    • From example – the html element must have two children, a head element followed by a body element.
 
  • Several basic XML content specifications are shown in Table 2.6.

  • Example – the element type declaration for the <br/> element and <p> is
<!ELEMENT br EMPTY>
<!ELEMENT p ANY>
  • The keyword #PCDATA (“Parsed Character DATA”) used in defining the character data and mixed content types represents any string of characters (excluding less-than and ampersand, which are excluded because they represent the start characters for markup.)
  • More sophisticated specifications may be formed by appending one of the iterator characters of Table 2.7 to the basic sequence and choice content specification types.
  • Example <!ELEMENT select (optgroup|option)+>
  • is a choice specification type with the + iteration character
    • A select element may contain any number of optgroup and option elements in any order, as long as one or the other of these two elements appears at least once
 
  • Sequence and choice specifications can be nested, and an element name within a sequence or choice may have an iterator character suffixed to it.
  • Example –
<!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
  • A table may optionally begin with a caption, followed optionally by either a sequence of col or colgroup elements, followed optionally by a thead and then optionally by a tfoot and finally a sequence of one or more tbody or tr elements.

Attribute list declaration.
<!ATTLIST html
lang NMTOKEN #IMPLIED
xml:lang NMTOKEN #IMPLIED
dir (ltr|rtl) #IMPLIED
id ID #IMPLIED
xmlns CDATA #FIXED ‘http://www.w3.org/xhtml’>
 
  • An attribute list declaration is included in the DTD for each element that has attributes.
  • An attribute list declaration
    • begins with the keyword ATTLIST
    • followed by an element type name and
    • Three values
      • the attribute name – specifies the names for all the valid attributes of the named element,
      • the attribute type – specifies the type of data that may be used to specify the attribute value and the valid set of values for each attribute,
      • the default value declaration
 
  • From example – the html element has five attributes: lang, xml:lang, dir, id, and xmlns.
 
 
  • Table 2.8 gives the attribute types used in the definition of XHTML

  • NMTOKEN (name token) – a string of characters representing a name (“word”).
    • The ASCII characters that can be used in a NMTOKEN are letters, digits, and the four characters period (.), hyphen (-), underscore( ), and colon (:).
  • Enumerated attribute type – the allowable values for an enumerated type are separated by OR (|) symbols. The attribute can be assigned only one of specified values
  • ID attribute type – supplies an identifying name for its element. – May begin with a letter, underscore, or colon. – Must be unique
  • IDREF attribute type (an id reference) indicates that the value of the associated attribute must be identical to the value of the id of some element of the document. – used for linking one element with another element.
  • The IDREFS attribute type – similar to IDREF, except that it allows for a white-space-separated list of id values rather than the single id value allowed by IDREF.
  • CDATA attribute type – represents any string of characters that excludes the less-than(<), ampersand(&) and quoting characters(” or ‘)
 
 
 
  • The default declaration for an attribute specifies what value should be used
    • If no value is specified for the attribute in an element of the document or
    • If a value is assigned but does not conform to the attribute’s type.
 
  • The default declaration for an attribute can take one of the forms as shown in Table 2.9.
  • #IMPLIED
    • attribute need not be assigned a value in the start tag for the element and the DTD does not define a default value for the attribute
    • Application reading the XML document (Browser) may assign a default value of its choice to the attribute
  • #FIXED
    • the default value of an attribute and is not allowed to be overridden by the document
  • Default provided by DTD:
    • The DTD itself can also supply a default value for an attribute, which can be overridden by the user.
    • Example
<!ATTLIST form
method (get|post) “get”
>
  • #REQUIRED
    • a value must be specified for the corresponding attribute whenever the element containing that attribute appears in a valid document.
 
Entity declaration
  • Entity declaration begins with the keyword ENTITY followed by an entity name and its replacement text
  • Entity declaration is essentially a macro definition
  • From example – associating the name gt (an entity) with the string &#62;.
  • An application reading a document containing an entity reference simply replaces the reference with the string represented by the entity, and then recursively processes this string.
 
  • XML also provides for a different type of entity that can be referenced from within DTDs and not from documents. Such entities are called parameter entities.
  • A parameter entity declaration is indicated in the DTD by following the ENTITY keyword with a percent sign (%)
  • Example:
<!ENTITY % URI “CDATA”>
  • The XHTML attribute list declaration for the html element is
<!ATTLIST html
lang NMTOKEN #IMPLIED
xml:lang NMTOKEN #IMPLIED
dir (ltr|rtl) #IMPLIED
id ID #IMPLIED
xmlns %URI; #FIXED ‘http://www.w3.org/1999/xhtml’>
  • This is equivalent to the version of this declaration given earlier.
 

DTD Files
  • Example Document Type Declaration for XHTML:
<!DOCTYPE html
PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN”
“http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”>
 
  • The string immediately following the PUBLIC keyword is called the formal public identifier for the DTD.
  • The URL at the end of the tag is the location of a copy of the DTD for the document instance that follows the DOCTYPE tag and is known as the system identifier for the DTD.