XML Processing
- There are several standard approaches to processing XML documents in Java.
- based on the Document Object Model
- Java based XML packages.
Figure 7.1 is an XML document that represents a RSS (rich site summary) feed, a document designed to provide summary information about a Web site.
It is used as the input file to explain the parsing methods
<!DOCTYPE rss
SYSTEM “http://my.netscape.com/publish/formats/rss-0.91.dtd”>
<rss version=”0.91″>
<channel>
<title>www.example.com</title>
<link>http://www.example.com/</link>
<description>
www.example.com is not a site that changes often…
</description>
<language>en-us</language>
<item>
<title>Announcing a Sibling Site!</title>
<link>http://www.example.org/</link>
<description>
Were you aware that example.com is not the only site in the example family?
</description>
</item>
<item>
<title>We’re Up!</title>
<link>http://www.example.net/</link>
<description>
Our new RSS feed is up. Visit us today!
</description>
</item>
</channel>
</rss>
DOM-Based XML Processing
- In DOM-based XML processing, an XML document is first input and parsed, creating a tree of nodes representing elements, text, comments, and so on.
- After the tree has been constructed, DOM methods can be called to modify the tree, extract data from it, and so on.
- The Java DOM API is defined as part of the standard Java API and specifies a number of interfaces that correspond to DOM objects and classes, such as Node, Document, Element, and Text.
- Consider the Java program of Figure 7.7. This program performs the following task: input from a user-specified file – xml document and output the number of link elements contained in the input document.
// JAXP classes
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
// DOM classes
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
// JDK classes
import java.io.File;
/** Count the number of link elements in an xml document */
class DOMCountLinks
{
/** Main program does it all */
static public void main(String args[])
{
try
{
// JAXP-style initialization of DocumentBuilder
// (XML parser that builds DOM from document)
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = docBuilderFactory.newDocumentBuilder();
// Parse XML document from given file into a DOM Document object
Document document = parser.parse(new File(args[0]));
// Process the Document object using the Java API version of the W3C DOM
NodeList links = document.getElementsByTagName(“link”);
System.out.println(“Input document has ” + links.getLength() + ” ‘link’ elements.”);
}
catch (Exception e)
{
e.printStackTrace();
}
return;
}
}
FIGURE 7.7 DOM-based program for displaying the number of links in an xml document.
- The program
- open the file specified by the first command-line argument
- parse the XML document contained in this file,
- produce a Document object.
- use getElementsByTagName() method to retrieve a list of Node objects corresponding to elements of type link in the XML document.
- Count and output the number of link elements in xml document
DocumentBuilderFactory docBF = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = docBF.newDocumentBuilder();
- Standard technique for obtaining a DOM-based parser in a Java program using the Sun Java API for XML Processing (JAXP).
- JAXP provides a unified approach to creating parser instances through a factory mechanism.
- (A factory is just an object that is used to create other objects.)
- In JAXP, a two-stage approach is used to create a parser.
- (1) The factory itself is created by a call to the static newInstance() method of DocumentBuilderFactory.
- (2) Once the factory instance has been created, it is used to create the actual DOM-based parser by a call to the factory’s newDocumentBuilder() method.
- By default, a DocumentBuilderFactory instance creates a parser that is nonvalidating and not namespace-aware. These defaults can be overridden by calling the methods setValidating(true) and setNamespaceAware(true), respectively, on the DocumentBuilderFactory instance before creating the DocumentBuilder instance.
- Example, this code creates a parser that is nonvalidating but namespace-aware.
DocumentBuilderFactory docBF = DocumentBuilderFactory.newInstance();
docBF.setNamespaceAware(true);
DocumentBuilder parser = docBF.newDocumentBuilder();
- Methods in W3C’s DOM API to support namespace-aware processing
- getElementsByTagNameNS(), which returns a NodeList for a given element type name. It takes two String arguments: a namespace name (URI) followed by a local name.
- Example
NodeList links = document.getElementsByTagNameNS(null, “link”);
- Retrieves a NodeList containing all of the link elements in the document that belong to no namespace.
NodeList links = document.getElementsByTagNameNS(“http://www.w3.org/1999/xhtml”, “link”);
- Retrieves a NodeList containing all of the link elements in the valid XHTML document that belong to the default namespace http://www.w3.org/1999/xhtml
- Note:
- Although the DocumentBuilder class has only been used to parse existing documents, it also provides a method newDocument() that can be called in order to create an empty Document object.
- DOM API methods, such as createElement() or createElementNS(), can then be called to construct an internal representation of an XML document.
- Drawback of processing an XML document using the DOM approach is that the entire document tree must be created, even if only a fraction of the document is actually pertinent to the software processing the document. Also, loading the entire tree for such a simple task will almost certainly use much more memory than is necessary, and will probably use more CPU time as well (for allocating memory, creating data structures, etc.).