DOM-Based XML Processing

XML Processing

  • There are several standard approaches to processing XML documents in Java.
    • based on the Document Object Model
    • Java based XML packages.

Figure 7.1 is an XML document that represents a RSS (rich site summary) feed, a document designed to provide summary information about a Web site.
It is used as the input file to explain the parsing methods

<!DOCTYPE rss

SYSTEM “http://my.netscape.com/publish/formats/rss-0.91.dtd”>

<rss version=”0.91″>

<channel>

<title>www.example.com</title>

<link>http://www.example.com/</link>

<description>

www.example.com is not a site that changes often…

</description>

<language>en-us</language>

<item>

<title>Announcing a Sibling Site!</title>

<link>http://www.example.org/</link>

<description>

Were you aware that example.com is not the only site in the example family?

</description>

</item>

<item>

<title>We’re Up!</title>

<link>http://www.example.net/</link>

<description>

Our new RSS feed is up. Visit us today!

</description>

</item>

</channel>

</rss>


DOM-Based XML Processing

  • In DOM-based XML processing, an XML document is first input and parsed, creating a tree of nodes representing elements, text, comments, and so on.
  • After the tree has been constructed, DOM methods can be called to modify the tree, extract data from it, and so on.
  • The Java DOM API is defined as part of the standard Java API and specifies a number of interfaces that correspond to DOM objects and classes, such as Node, Document, Element, and Text.

 

  • Consider the Java program of Figure 7.7. This program performs the following task: input from a user-specified file – xml document and output the number of link elements contained in the input document.

// JAXP classes

import javax.xml.parsers.DocumentBuilderFactory;

import javax.xml.parsers.DocumentBuilder;

// DOM classes

import org.w3c.dom.Document;

import org.w3c.dom.NodeList;

// JDK classes

import java.io.File;

 

/** Count the number of link elements in an xml document */

class DOMCountLinks

{

/** Main program does it all */

static public void main(String args[])

{

try

{

// JAXP-style initialization of DocumentBuilder

// (XML parser that builds DOM from document)

DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();

DocumentBuilder parser = docBuilderFactory.newDocumentBuilder();

 

// Parse XML document from given file into a DOM Document object

Document document = parser.parse(new File(args[0]));

 

// Process the Document object using the Java API version of the W3C DOM

NodeList links = document.getElementsByTagName(“link”);

System.out.println(“Input document has ” + links.getLength() + ” ‘link’ elements.”);

}

catch (Exception e)

{

e.printStackTrace();

}

return;

}

}

FIGURE 7.7 DOM-based program for displaying the number of links in an xml document.

  • The program
    • open the file specified by the first command-line argument
    • parse the XML document contained in this file,
    • produce a Document object.
    • use getElementsByTagName() method to retrieve a list of Node objects corresponding to elements of type link in the XML document.
    • Count and output the number of link elements in xml document

DocumentBuilderFactory docBF = DocumentBuilderFactory.newInstance();

DocumentBuilder parser = docBF.newDocumentBuilder();

 

  • Standard technique for obtaining a DOM-based parser in a Java program using the Sun Java API for XML Processing (JAXP).
  • JAXP provides a unified approach to creating parser instances through a factory mechanism.
  • (A factory is just an object that is used to create other objects.)
  • In JAXP, a two-stage approach is used to create a parser.
    • (1) The factory itself is created by a call to the static newInstance() method of DocumentBuilderFactory.
    • (2) Once the factory instance has been created, it is used to create the actual DOM-based parser by a call to the factory’s newDocumentBuilder() method.

 

  • By default, a DocumentBuilderFactory instance creates a parser that is nonvalidating and not namespace-aware. These defaults can be overridden by calling the methods setValidating(true) and setNamespaceAware(true), respectively, on the DocumentBuilderFactory instance before creating the DocumentBuilder instance.
  • Example, this code creates a parser that is nonvalidating but namespace-aware.

DocumentBuilderFactory docBF = DocumentBuilderFactory.newInstance();

docBF.setNamespaceAware(true);

DocumentBuilder parser = docBF.newDocumentBuilder();

 

  • Methods in W3C’s DOM API to support namespace-aware processing
  • getElementsByTagNameNS(), which returns a NodeList for a given element type name. It takes two String arguments: a namespace name (URI) followed by a local name.
  • Example

NodeList links = document.getElementsByTagNameNS(null, “link”);

  • Retrieves a NodeList containing all of the link elements in the document that belong to no namespace.

NodeList links = document.getElementsByTagNameNS(“http://www.w3.org/1999/xhtml”, “link”);

  • Retrieves a NodeList containing all of the link elements in the valid XHTML document that belong to the default namespace http://www.w3.org/1999/xhtml

  • Note:
    • Although the DocumentBuilder class has only been used to parse existing documents, it also provides a method newDocument() that can be called in order to create an empty Document object.
    • DOM API methods, such as createElement() or createElementNS(), can then be called to construct an internal representation of an XML document.

 

  • Drawback of processing an XML document using the DOM approach is that the entire document tree must be created, even if only a fraction of the document is actually pertinent to the software processing the document. Also, loading the entire tree for such a simple task will almost certainly use much more memory than is necessary, and will probably use more CPU time as well (for allocating memory, creating data structures, etc.).