DOM-Based XML Processing

XML Processing

There are several standard approaches to processing XML documents in Java.
- based on the Document Object Model
- Java based XML packages.

Figure 7.1 is an XML document that represents a RSS (rich site summary) feed, a document designed to provide summary information about a Web site.

It is used as the input file to explain the parsing methods

<!DOCTYPE rss

SYSTEM “http://my.netscape.com/publish/formats/rss-0.91.dtd”>

<title>www.example.com</title>

www.example.com is not a site that changes often…

</description>

<item>

<title>Announcing a Sibling Site!</title>

Were you aware that example.com is not the only site in the example family?

</description>

</item>

<item>

Our new RSS feed is up. Visit us today!

</description>

</item>

</channel>

</rss>

DOM-Based XML Processing

In DOM-based XML processing, an XML document is first input and parsed, creating a tree of nodes representing elements, text, comments, and so on.
After the tree has been constructed, DOM methods can be called to modify the tree, extract data from it, and so on.
The Java DOM API is defined as part of the standard Java API and specifies a number of interfaces that correspond to DOM objects and classes, such as Node, Document, Element, and Text.

Consider the Java program of Figure 7.7. This program performs the following task: input from a user-specified file – xml document and output the number of link elements contained in the input document.

// JAXP classes

import javax.xml.parsers.DocumentBuilderFactory;

import javax.xml.parsers.DocumentBuilder;

// DOM classes

import org.w3c.dom.Document;

import org.w3c.dom.NodeList;

// JDK classes

import java.io.File;

/** Count the number of link elements in an xml document */

class DOMCountLinks

{

/** Main program does it all */

static public void main(String args[])

{

try

{

// JAXP-style initialization of DocumentBuilder

// (XML parser that builds DOM from document)

DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();

DocumentBuilder parser = docBuilderFactory.newDocumentBuilder();

// Parse XML document from given file into a DOM Document object

Document document = parser.parse(new File(args[0]));

// Process the Document object using the Java API version of the W3C DOM

NodeList links = document.getElementsByTagName(“link”);

System.out.println(“Input document has ” + links.getLength() + ” ‘link’ elements.”);

}

catch (Exception e)

{

e.printStackTrace();

}

return;

}

FIGURE 7.7 DOM-based program for displaying the number of links in an xml document.

The program
- open the file specified by the first command-line argument
- parse the XML document contained in this file,
- produce a Document object.
- use getElementsByTagName() method to retrieve a list of Node objects corresponding to elements of type link in the XML document.
- Count and output the number of link elements in xml document

DocumentBuilderFactory docBF = DocumentBuilderFactory.newInstance();

DocumentBuilder parser = docBF.newDocumentBuilder();

Standard technique for obtaining a DOM-based parser in a Java program using the Sun Java API for XML Processing (JAXP).
JAXP provides a unified approach to creating parser instances through a factory mechanism.
(A factory is just an object that is used to create other objects.)
In JAXP, a two-stage approach is used to create a parser.
- (1) The factory itself is created by a call to the static newInstance() method of DocumentBuilderFactory.
- (2) Once the factory instance has been created, it is used to create the actual DOM-based parser by a call to the factory’s newDocumentBuilder() method.

By default, a DocumentBuilderFactory instance creates a parser that is nonvalidating and not namespace-aware. These defaults can be overridden by calling the methods setValidating(true) and setNamespaceAware(true), respectively, on the DocumentBuilderFactory instance before creating the DocumentBuilder instance.
Example, this code creates a parser that is nonvalidating but namespace-aware.

DocumentBuilderFactory docBF = DocumentBuilderFactory.newInstance();

docBF.setNamespaceAware(true);

DocumentBuilder parser = docBF.newDocumentBuilder();

Methods in W3C’s DOM API to support namespace-aware processing
getElementsByTagNameNS(), which returns a NodeList for a given element type name. It takes two String arguments: a namespace name (URI) followed by a local name.
Example

NodeList links = document.getElementsByTagNameNS(null, “link”);

Retrieves a NodeList containing all of the link elements in the document that belong to no namespace.

NodeList links = document.getElementsByTagNameNS(“http://www.w3.org/1999/xhtml”, “link”);

Retrieves a NodeList containing all of the link elements in the valid XHTML document that belong to the default namespace http://www.w3.org/1999/xhtml

Note:
- Although the DocumentBuilder class has only been used to parse existing documents, it also provides a method newDocument() that can be called in order to create an empty Document object.
- DOM API methods, such as createElement() or createElementNS(), can then be called to construct an internal representation of an XML document.

Drawback of processing an XML document using the DOM approach is that the entire document tree must be created, even if only a fraction of the document is actually pertinent to the software processing the document. Also, loading the entire tree for such a simple task will almost certainly use much more memory than is necessary, and will probably use more CPU time as well (for allocating memory, creating data structures, etc.).