Representing Web Data XML

Representing Web Data XML

XML is the most widely used technology for transmitting structured data over the Web.
Consider a simple XML document:

<text>

Hello World!

</text>

Any document that follows the syntactic rules in the preceding list and that has a single root element is an example of a well-formed XML document.
So the above “Hello World!” example document is a well-formed XML document.

XML Documents and Vocabularies

The basic XML syntax rules (13) are as follows
- An XML document consists of markup and character data.
- There are two types of markup: tags and references.
- Tags begin with a less-than (<) character and end with a greater-than (>) character.
- References in an XML document begin with an ampersand (&) and are of two types: character references (such as © for copyright symbol) and entity references (such as < for <).
- All XML documents may make references to the entities lt, gt, amp, apos, and quot, which map to the characters less than (<), greater than (>), ampersand (&), single quote (‘), and double quote (“), respectively. Other entities may also be defined, depending on the XML DTD used and/or application processing the document.
- If not used to begin markup, the characters < and & must be escaped.
- Element tags are of three types: start tags, end tags (which begin with </), and empty element tags (which end with />).
- Character data may only appear within a nonempty element.
- Start and end tags must be paired and must be properly nested with other pairs of start and end tags.
- Attribute specifications may appear within start tags or empty-element tags. Every attribute specification consists of an attribute name followed by an equals sign (=) followed by a quoted attribute value. An attribute value may not contain the character <; if the character & appears, it must be the first character in a character or entity reference. A pair of either single quotes or double quotes may be used to quote an attribute value. Attribute specifications are white-space-separated from one another.
- Element and attribute names are case sensitive.
- The XML white space characters are the same as in XHTML: space, carriage return, line feed, and tab.
- XML comments begin with <!–, end with –>, and may not contain the string — elsewhere within the content of the comment.

A Java program that processes an XML document typically has two parts.
- XML parser to convert between the XML document and some internal data format, such as a tree structure
- The actual application software that processes data represented by the XML document. This part deals with the semantics of the XML document – element types, attribute names, what they represent, etc.,

XML Parser – two types – Nonvalidating parser and Validating parser
Nonvalidating parser –
- verify that an input XML document is well formed
- Even if a DTD is present in the XML document, a nonvalidating parser is not required to read a DTD that is external to the document.
Validating parser –
- requires that any document it parses contain a document type declaration.
- Validating parser will read the DTD, verify that the document conforms with DTD, and also verify that the document meets validity constraints defined by the XML 1.0 recommendation.
  - Example of validity constraint – each assignment to an attribute with data type ID must be distinct from the values assigned to all other such attributes.
- Validating parsers – advantage – every correct implementation of a validating parser should produce essentially the same results when parsing a given XML document.
  - This means that if you write an XML processing application using one validating parser implementation and later decide to use a different parser, you should be able to substitute the new parser into your application without making any changes to the application code you have written.
- Nonvalidating parsers may read all, some, or none of a DTD, one implementation might read the entity declarations and default attribute values contained in a DTD while another implementation might not. When parsing a given XML document, different nonvalidating parsers may produce different results.

Nonvalidating parsers have the advantage that they will generally run faster than validating parsers because they perform less validation.

If either a validating or a nonvalidating parser detects an error while parsing a document, the parser generally signals the error to the application program that called the parser API. Handling of the error depends on the application.