Representing Web Data XML

Representing Web Data XML

 

  • XML is the most widely used technology for transmitting structured data over the Web.
  • Consider a simple XML document:

<text>

Hello World!

</text>

  • Any document that follows the syntactic rules in the preceding list and that has a single root element is an example of a well-formed XML document.
  • So the above “Hello World!” example document is a well-formed XML document.

XML Documents and Vocabularies

  • The basic XML syntax rules (13) are as follows
    • An XML document consists of markup and character data.
    • There are two types of markup: tags and references.
    • Tags begin with a less-than (<) character and end with a greater-than (>) character.
    • References in an XML document begin with an ampersand (&) and are of two types: character references (such as &#169; for copyright symbol) and entity references (such as &lt; for <).
    • All XML documents may make references to the entities lt, gt, amp, apos, and quot, which map to the characters less than (<), greater than (>), ampersand (&), single quote (‘), and double quote (“), respectively. Other entities may also be defined, depending on the XML DTD used and/or application processing the document.
    • If not used to begin markup, the characters < and & must be escaped.
    • Element tags are of three types: start tags, end tags (which begin with </), and empty element tags (which end with />).
    • Character data may only appear within a nonempty element.
    • Start and end tags must be paired and must be properly nested with other pairs of start and end tags.
    • Attribute specifications may appear within start tags or empty-element tags. Every attribute specification consists of an attribute name followed by an equals sign (=) followed by a quoted attribute value. An attribute value may not contain the character <; if the character & appears, it must be the first character in a character or entity reference. A pair of either single quotes or double quotes may be used to quote an attribute value. Attribute specifications are white-space-separated from one another.
    • Element and attribute names are case sensitive.
    • The XML white space characters are the same as in XHTML: space, carriage return, line feed, and tab.
    • XML comments begin with <!–, end with –>, and may not contain the string — elsewhere within the content of the comment.

 

  • A Java program that processes an XML document typically has two parts.
    • XML parser to convert between the XML document and some internal data format, such as a tree structure
    • The actual application software that processes data represented by the XML document. This part deals with the semantics of the XML document – element types, attribute names, what they represent, etc.,

 

  • XML Parser – two types – Nonvalidating parser and Validating parser
  • Nonvalidating parser –
    • verify that an input XML document is well formed
    • Even if a DTD is present in the XML document, a nonvalidating parser is not required to read a DTD that is external to the document.
  • Validating parser –
    • requires that any document it parses contain a document type declaration.
    • Validating parser will read the DTD, verify that the document conforms with DTD, and also verify that the document meets validity constraints defined by the XML 1.0 recommendation.
      • Example of validity constraint – each assignment to an attribute with data type ID must be distinct from the values assigned to all other such attributes.
    • Validating parsers – advantage – every correct implementation of a validating parser should produce essentially the same results when parsing a given XML document.
      • This means that if you write an XML processing application using one validating parser implementation and later decide to use a different parser, you should be able to substitute the new parser into your application without making any changes to the application code you have written.
    • Nonvalidating parsers may read all, some, or none of a DTD, one implementation might read the entity declarations and default attribute values contained in a DTD while another implementation might not. When parsing a given XML document, different nonvalidating parsers may produce different results.

 

  • Nonvalidating parsers have the advantage that they will generally run faster than validating parsers because they perform less validation.

 

  • If either a validating or a nonvalidating parser detects an error while parsing a document, the parser generally signals the error to the application program that called the parser API. Handling of the error depends on the application.