![]() | |
|
The Java API for XML Processing (JAXP) is for processing XML data using applications written in the Java programming language. JAXP leverages the parser standards Simple API for XML Parsing (SAX) and Document Object Model (DOM) so that you can choose to parse your data as a stream of events or to build an object representation of it. JAXP also supports the Extensible Stylesheet Language Transformations (XSLT) standard, giving you control over the presentation of the data and enabling you to convert the data to other XML documents or to other formats, such as HTML. JAXP also provides namespace support, allowing you to work with DTDs that might otherwise have naming conflicts.
Streaming API for XML (StAX) provides is the latest API in the JAXP family.
The main JAXP APIs are defined in the javax.xml.parsers package. That package contains vendor-neutral factory classes - SAXParserFactory, DocumentBuilderFactory, and TransformerFactory -which give you a SAXParser, a DocumentBuilder, and an XSLT transformer, respectively. DocumentBuilder, in turn, creates a DOM-compliant Document object.
The factory APIs let you plug in an XML implementation offered by another vendor without changing your source code. The implementation you get depends on the setting of the javax.xml.parsers.SAXParserFactory, javax.xml.parsers.DocumentBuilderFactory, and javax.xml.transform.TransformerFactory system properties, using System.setProperties() in the code, or -DpropertyName="..." on the command line. The default values (unless overridden at runtime on the command line or in the code) point to Sun's implementation.
The Simple API for XML (SAX) APIs
The basic outline of the SAX parsing APIs are shown below. To start the process, an instance of the SAXParserFactory class is used to generate an instance of the parser.
The parser wraps a SAXReader object. When the parser's parse() method is invoked, the reader invokes one of several callback methods implemented in the application. Those methods are defined by the interfaces ContentHandler, ErrorHandler, DTDHandler, and EntityResolver.
Here is a summary of the key SAX APIs:
SAXParserFactory
A SAXParserFactory object creates an instance of the parser determined by the system property, javax.xml.parsers.SAXParserFactory.
SAXParser
The SAXParser interface defines several kinds of parse() methods. In general, you pass an XML data source and a DefaultHandler object to the parser, which processes the XML and invokes the appropriate methods in the handler object.
public static void main(String argv[]) { // Use an instance of ourselves as the SAX event handler DefaultHandler handler = new Echo(); // Use the default (non-validating) parser SAXParserFactory factory = SAXParserFactory.newInstance(); try { // Set up output stream out = new OutputStreamWriter(System.out, "UTF8"); // Parse the input SAXParser saxParser = factory.newSAXParser(); saxParser.parse(new File(argv[0]), handler); } catch (Throwable t) { t.printStackTrace(); } }
SAXReader
The SAXParser wraps a SAXReader. Typically, you don't care about that, but every once in a while you need to get hold of it using SAXParser's getXMLReader() so that you can configure it. It is the SAXReader that carries on the conversation with the SAX event handlers you define.
DefaultHandler
Not shown in the diagram, a DefaultHandler implements the ContentHandler, ErrorHandler, DTDHandler, and EntityResolver interfaces (with null methods), so you can override only the ones you're interested in.
ContentHandler
Methods such as startDocument, endDocument, startElement, and endElement are invoked when an XML tag is recognized. This interface also defines the methods characters and processingInstruction, which are invoked when the parser encounters the text in an XML element or an inline processing instruction, respectively.
ErrorHandler
Methods error, fatalError, and warning are invoked in response to various parsing errors. The default error handler throws an exception for fatal errors and ignores other errors (including validation errors). That's one reason you need to know something about the SAX parser, even if you are using the DOM. Sometimes, the application may be able to recover from a validation error. Other times, it may need to generate an exception. To ensure the correct handling, you'll need to supply your own error handler to the parser.
DTDHandler
Defines methods you will generally never be called upon to use. Used when processing a DTD to recognize and act on declarations for an unparsed entity.
EntityResolver
The resolveEntity method is invoked when the parser must identify data identified by a URI. In most cases, a URI is simply a URL, which specifies the location of a document, but in some cases the document may be identified by a URN - a public identifier, or name, that is unique in the web space. The public identifier may be specified in addition to the URL. The EntityResolver can then use the public identifier instead of the URL to find the document - for example, to access a local copy of the document if one exists.
A typical application implements most of the ContentHandler methods, at a minimum. Because the default implementations of the interfaces ignore all inputs except for fatal errors, a robust implementation may also want to implement the ErrorHandler methods.
When to Use SAX
SAX is fast and efficient, but its event model makes it most useful for such state-independent filtering. For example, a SAX parser calls one method in your application when an element tag is encountered and calls a different method when text is found. If the processing you're doing is state-independent (meaning that it does not depend on the elements have come before), then SAX works fine.
On the other hand, for state-dependent processing, where the program needs to do one thing with the data under element A but something different with the data under element B, then a pull parser such as the Streaming API for XML (StAX) would be a better choice. With a pull parser, you get the next node, whatever it happens to be, at any point in the code that you ask for it. So it's easy to vary the way you process text (for example), because you can process it multiple places in the program.
SAX requires much less memory than DOM, because SAX does not construct an internal representation (tree structure) of the XML data, as a DOM does. Instead, SAX simply sends data to the application as it is read; your application can then do whatever it wants to do with the data it sees.
Pull parsers and the SAX API both act like a serial I/O stream. You see the data as it streams in, but you can't go back to an earlier position or leap ahead to a different position. In general, such parsers work well when you simply want to read data and have the application act on it.
But when you need to modify an XML structure - especially when you need to modify it interactively - an in-memory structure makes more sense. DOM is one such model. However, although DOM provides many powerful capabilities for large-scale documents (like books and articles), it also requires a lot of complex coding.
The Document Object Model (DOM) APIs
Figure below shows the DOM APIs in action:
You use the javax.xml.parsers.DocumentBuilderFactory class to get a DocumentBuilder instance, and you use that instance to produce a Document object that conforms to the DOM specification. The builder you get, in fact, is determined by the system property javax.xml.parsers.DocumentBuilderFactory, which selects the factory implementation that is used to produce the builder.
static Document document; public static void main(String argv[]) { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); try { DocumentBuilder builder = factory.newDocumentBuilder(); document = builder.parse( new File(argv[0]) ); } catch (SAXParseException spe) { ... } }
You can also use the DocumentBuilder.newDocument() method to create an empty Document that implements the org.w3c.dom.Document interface. Alternatively, you can use one of the builder's parse methods to create a Document from existing XML data.
public static void buildDom() { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); try { DocumentBuilder builder = factory.newDocumentBuilder(); document = builder.newDocument(); Element root = (Element) document.createElement("rootElement"); document.appendChild(root); root.appendChild(document.createTextNode("Some")); root.appendChild(document.createTextNode(" ")); root.appendChild( document.createTextNode("text") ); } catch (ParserConfigurationException pce) { // Parser with specified options can't be built pce.printStackTrace(); } }
When to Use DOM
DOM is platform-neutral and language-neutral API that enables programs to dynamically update the contents of XML documents.
DOM creates an in-memory object representation of an entire XML document. This allows extreme flexibility in parsing, navigating, and updating the contents of a document. DOM's drawbacks are high memory requirements and the need for more powerful processing capabilities.
The Extensible Stylesheet Language Transformations (XSLT) APIs
Figure below shows the XSLT APIs in action:
A TransformerFactory object is instantiated and used to create a Transformer. The source object is the input to the transformation process. A source object can be created from a SAX reader, from a DOM, or from an input stream.
Similarly, the result object is the result of the transformation process. That object can be a SAX event handler, a DOM, or an output stream.
When the transformer is created, it can be created from a set of transformation instructions, in which case the specified transformations are carried out. If it is created without any specific instructions, then the transformer object simply copies the source to the result.
Here is a description of the packages that make up the JAXP Transformation APIs:
javax.xml.transform
This package defines the factory class you use to get a Transformer object. You then configure the transformer with input (source) and output (result) objects, and invoke its transform() method to make the transformation happen. The source and result objects are created using classes from one of the other three packages.
javax.xml.transform.dom
Defines the DOMSource and DOMResult classes, which let you use a DOM as an input to or output from a transformation.
Document document; try { File f = new File(argv[0]); DocumentBuilder builder = factory.newDocumentBuilder(); document = builder.parse(f); // Use a Transformer for output TransformerFactory tFactory = TransformerFactory.newInstance(); Transformer transformer = tFactory.newTransformer(); DOMSource source = new DOMSource(document); StreamResult result = new StreamResult(System.out); transformer.transform(source, result); }
javax.xml.transform.sax
Defines the SAXSource and SAXResult classes, which let you use a SAX event generator as input to a transformation, or deliver SAX events as output to a SAX event processor.
public class AddressBookReader implements XMLReader {...}
// Create the sax "parser" AddressBookReader saxReader = new AddressBookReader(); Transformer transformer = tFactory.newTransformer(); File f = new File(argv[0]); // Use the parser as a SAX source for input FileReader fr = new FileReader(f); BufferedReader br = new BufferedReader(fr); InputSource inputSource = new InputSource(br); SAXSource source = new SAXSource(saxReader, inputSource); StreamResult result = new StreamResult(System.out); transformer.transform(source, result);
javax.xml.transform.stream
Defines the StreamSource and StreamResult classes, which let you use an I/O stream as an input to or output from a transformation.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); //factory.setNamespaceAware(true); //factory.setValidating(true); try { File stylesheet = new File(argv[0]); File datafile = new File(argv[1]); DocumentBuilder builder = factory.newDocumentBuilder(); document = builder.parse(datafile); // Use a Transformer for output TransformerFactory tFactory = TransformerFactory.newInstance(); StreamSource stylesource = new StreamSource(stylesheet); Transformer transformer = tFactory.newTransformer(stylesource); DOMSource source = new DOMSource(document); StreamResult result = new StreamResult(System.out); transformer.transform(source, result); }
StAX API
The StAX API exposes methods for iterative, event-based processing of XML documents. XML documents are treated as a filtered series of events, and infoset states can be stored in a procedural fashion. Moreover, unlike SAX, the StAX API is bidirectional, meaning that it can both read and write XML documents.
The StAX API is really two distinct API sets: a cursor API and an iterator API.
Cursor API
As the name implies, the StAX cursor API represents a cursor with which you can walk an XML document from beginning to end. This cursor can point to one thing at a time, and always moves forward, never backward, usually one infoset element at a time.
The two main cursor interfaces are XMLStreamReader and XMLStreamWriter. XMLStreamReader includes accessor methods for all possible information retrievable from the XML Information model, including document encoding, element names, attributes, namespaces, text nodes, start tags, comments, processing instructions, document boundaries, and so forth; for example:
public interface XMLStreamReader { public int next() throws XMLStreamException; public boolean hasNext() throws XMLStreamException; public String getText(); public String getLocalName(); public String getNamespaceURI(); // ... other methods not shown }
A typical StAX program begins by using the XMLInputFactory class to load an implementation dependent instance of XMLStreamReader:
URL u = new URL("http://www.example.com/data.xml"); InputStream in = u.openStream(); XMLInputFactory factory = XMLInputFactory.newInstance(); XMLStreamReader parser = factory.createXMLStreamReader(in);The next() method in XMLStreamReader advances the cursor to the next item. Various getter methods to extract data from the current item. Some of the most important of these getters include: getName(), getLocalName(), getText(), etc.
For example, here's a simple bit of code that iterates through an XML document and prints out the names of the different elements it encounters:
while (true) { int event = parser.next(); if (event == XMLStreamConstants.END_DOCUMENT) { parser.close(); break; } if (event == XMLStreamConstants.START_ELEMENT) { System.out.println(parser.getLocalName()); } }
You can call methods on XMLStreamReader, such as getText and getName, to get data at the current cursor location. XMLStreamWriter provides methods that correspond to StartElement and EndElement event types; for example:
public interface XMLStreamWriter { public void writeStartElement(String localName) throws XMLStreamException; public void writeEndElement() throws XMLStreamException; public void writeCharacters(String text) throws XMLStreamException; // ... other methods not shown }
This interface provides methods to write elements, attributes, comments, text, and all the other parts of an XML document. An XMLStreamWriter is created by an XMLOutputFactory like this:
OutputStream out = new FileOutputStream("data.xml"); XMLOutputFactory factory = XMLOutputFactory.newInstance(); XMLStreamWriter writer = factory.createXMLStreamWriter(out);
You write data onto the stream by using various writeXXX methods: writeStartDocument, writeStartElement, writeEndElement, writeCharacters, writeComment, writeCDATA, etc. For example, these lines of code write a simple document:
writer.writeStartDocument("ISO-8859-1", "1.0"); writer.writeStartElement("greeting"); writer.writeAttribute("id", "g1"); writer.writeCharacters("Hello StAX"); writer.writeEndDocument();
The cursor API mirrors SAX in many ways. For example, methods are available for directly accessing string and character information, and integer indexes can be used to access attribute and namespace information. As with SAX, the cursor API methods return XML information as strings, which minimizes object allocation requirements.
Iterator API
The StAX iterator API represents an XML document stream as a set of discrete event objects. These events are pulled by the application and provided by the parser in the order in which they are read in the source XML document.
The base iterator interface is called XMLEvent, and there are subinterfaces for each event type (StartDocument, StartElement, EndElement, etc.). The primary parser interface for reading iterator events is XMLEventReader, and the primary interface for writing iterator events is XMLEventWriter. The XMLEventReader interface contains five methods, the most important of which is nextEvent(), which returns the next event in an XML stream. XMLEventReader extends java.util.Iterator, which means that returns from XMLEventReader can be cached or passed into routines that can work with the standard Java Iterator; for example:
public interface XMLEventReader extends Iterator { public XMLEvent nextEvent() throws XMLStreamException; public boolean hasNext(); public XMLEvent peek() throws XMLStreamException; // ... other methods not shown }
Similarly, on the output side of the iterator API, you have:
public interface XMLEventWriter { public void flush() throws XMLStreamException; public void close() throws XMLStreamException; public void add(XMLEvent e) throws XMLStreamException; public void add(Attribute attribute) throws XMLStreamException; // ... other methods not shown }
Comparing Cursor and Iterator APIs
Before choosing between the cursor and iterator APIs, you should note a few things that you can do with the iterator API that you cannot do with cursor API:
Objects created from the XMLEvent subclasses are immutable, and can be used in arrays, lists, and maps, and can be passed through your applications even after the parser has moved on to subsequent events.
You can create subtypes of XMLEvent that are either completely new information items or extensions of existing items but with additional methods.
You can add and remove events from an XML event stream in much simpler ways than with the cursor API.
Similarly, keep some general recommendations in mind when making your choice:
If you are programming for a particularly memory-constrained environment, like J2ME, you can make smaller, more efficient code with the cursor API.
If performance is your highest priority - for example, when creating low-level libraries or infrastructure - the cursor API is more efficient.
If you want to create XML processing pipelines, use the iterator API.
If you want to modify the event stream, use the iterator API.
If you want to your application to be able to handle pluggable processing of the event stream, use the iterator API.
In general, if you do not have a strong preference one way or the other, using the iterator API is recommended because it is more flexible and extensible, thereby "future-proofing" your applications.
When to Use StAX
StAX's bidirectional features, small memory footprint, and low processor requirements give it an advantage over APIs such as JAXB or DOM. StAX is particularly effective in extracting a small set of information from a large document. The primary drawback in using StAX is that you get a narrow view of the document - essentially you have to know what processing you will do before reading the XML document. Another drawback is that StAX is difficult to use if you return XML documents that follow complex schema.
![]() ![]() ![]() ![]() ![]() ![]() |
![]() |
![]() |