Representing Web Data XML
- XML is the most widely used technology for transmitting structured data over the Web.
- Consider a simple XML document:
<text>
Hello World!
</text>
- Any document that follows the syntactic rules in the preceding list and that has a single root element is an example of a well-formed XML document.
- So the above “Hello World!” example document is a well-formed XML document.
XML Documents and Vocabularies
- The basic XML syntax rules (13) are as follows
- An XML document consists of markup and character data.
- There are two types of markup: tags and references.
- Tags begin with a less-than (<) character and end with a greater-than (>) character.
- References in an XML document begin with an ampersand (&) and are of two types: character references (such as © for copyright symbol) and entity references (such as < for <).
- All XML documents may make references to the entities lt, gt, amp, apos, and quot, which map to the characters less than (<), greater than (>), ampersand (&), single quote (‘), and double quote (“), respectively. Other entities may also be defined, depending on the XML DTD used and/or application processing the document.
- If not used to begin markup, the characters < and & must be escaped.
- Element tags are of three types: start tags, end tags (which begin with </), and empty element tags (which end with />).
- Character data may only appear within a nonempty element.
- Start and end tags must be paired and must be properly nested with other pairs of start and end tags.
- Attribute specifications may appear within start tags or empty-element tags. Every attribute specification consists of an attribute name followed by an equals sign (=) followed by a quoted attribute value. An attribute value may not contain the character <; if the character & appears, it must be the first character in a character or entity reference. A pair of either single quotes or double quotes may be used to quote an attribute value. Attribute specifications are white-space-separated from one another.
- Element and attribute names are case sensitive.
- The XML white space characters are the same as in XHTML: space, carriage return, line feed, and tab.
- XML comments begin with <!–, end with –>, and may not contain the string — elsewhere within the content of the comment.
- A Java program that processes an XML document typically has two parts.
- XML parser to convert between the XML document and some internal data format, such as a tree structure
- The actual application software that processes data represented by the XML document. This part deals with the semantics of the XML document – element types, attribute names, what they represent, etc.,
- XML Parser – two types – Nonvalidating parser and Validating parser
- Nonvalidating parser –
- verify that an input XML document is well formed
- Even if a DTD is present in the XML document, a nonvalidating parser is not required to read a DTD that is external to the document.
- Validating parser –
- requires that any document it parses contain a document type declaration.
- Validating parser will read the DTD, verify that the document conforms with DTD, and also verify that the document meets validity constraints defined by the XML 1.0 recommendation.
- Example of validity constraint – each assignment to an attribute with data type ID must be distinct from the values assigned to all other such attributes.
- Validating parsers – advantage – every correct implementation of a validating parser should produce essentially the same results when parsing a given XML document.
- This means that if you write an XML processing application using one validating parser implementation and later decide to use a different parser, you should be able to substitute the new parser into your application without making any changes to the application code you have written.
- Nonvalidating parsers may read all, some, or none of a DTD, one implementation might read the entity declarations and default attribute values contained in a DTD while another implementation might not. When parsing a given XML document, different nonvalidating parsers may produce different results.
- Nonvalidating parsers have the advantage that they will generally run faster than validating parsers because they perform less validation.
- If either a validating or a nonvalidating parser detects an error while parsing a document, the parser generally signals the error to the application program that called the parser API. Handling of the error depends on the application.