eXploring What's neXt: XML, Information Sciences, Libraries and Markup Technology

eXploring What’s neXt:

XML, Information Sciences, and Markup Technology

John Robert Gardner

ATLA-CERTR

Emory University

Since becoming a formal recommendation by the World Wide Web consortium in the spring of 1998, XML as been described as everything from "dumbed-down SGML," "a panacea of data solutions," "the true promise of what the web was meant to be" to "vapor ware," and "assimilation by the Borg" (with apologies to Star Trek). The truth of the matter is a combination of these.

What is XML?

XML (Extensible Markup Language) is one of many "ML’s," or markup languages. Markup languages are basically a set of tags, or identifiers, that are added to a given text or data source. In effect, a book, essay, or other digital piece of information is effectively "marked up" with these tags. This is the principle lying behind SGML (Standard General Markup Language) and the more well-known HTML (Hypertext Markup Language).

The tags these markup languages use are composed of angle brackets, or less-than/greater-than symbols. The tags, also called "elements," are used to indicate a portion of a document which can be described by a general name, such as one to designate an author:

<author>Coleridge</author>

They generally appear at the beginning and immediately following whatever it is that they are marking. The tags are usually only visible to the computer or processing software. If it sounds simple, it really is. This basic format has remained unchanged for over 15 years and shows no indications of doing otherwise.

This simple syntax charactizes most of SGML, HTML, and also XML. Obviously the tags can be more complex, and contain more information, such as attributes:

<author name="last" more="Samuel T.">Coleridge</author>

In this example, "name" is an attribute whose value for this occasion of "Coleridge" is "last." The tags-or elements-and the attributes in this example say: Coleridge, an author, is referenced here by his last name, but also can be known by Samuel T. Coleridge.

XML works the same way with several important specifics. Any tag which begins-e.g., "<author>"-must also end, or be "closed"-e.g., "</author>." All tag names (e.g., "author" in the example above) must be in lower case. If capital letters are used, they must match in the starting and ending tags. Any attributes-such as 'name="last"’ above-must use quote marks.

Also, tags must nest within each other:

<author><bold>Coleridge</bold></author>

is okay, but

<author><bold>Coleridge</author></bold>

is not. There are a few other details, but this is the primary core of rules which make XML files better structured and easier to process-and preserve-than HTML or even SGML files.

Why is XML Important?

Building upon these strict rules of syntax, XML is both robust and versatile. XML is important not only because of this archival integrity, it is also extremely valuable because of the wide range of related technologies with which it is compatible. XML is fully compatible with existing SGML technology, and it’s fully workable with any Z39.50 information interchange system.

XML is at the hub of a set of information processing standards built around basic principles of ease in interchanging and accessing data, archivability, and affordability of processing software. It includes advanced linking capabilities that enable associated material for one query or search to be accessed from several different places and then embedded or included in the original document returned by the query.

Best of all, XML’s strict syntax rules enable it to be easily converted to other formats. For instance, going from MARC to XML and back again is simple. Accordingly, working with XML does not become a liability in the event of further changes in technology years down the road. In fact, it is insurance for your data that its format will not become obsolescent. XML is important as a conduit of present and past formats through the future of digital data technology.

How is XML different from HTML and SGML?

HTML does not have the capacity to represent such detailed information as in the "author" example above. This detail, often called "semantic detail" or "semantic markup" has, however, been effectively conveyed for the past decade or so by SGML. Nonetheless, the technology to work with SGML has remained so difficult to learn and expensive to deploy that information scientists and institutions have little or no time to spare for such steep learning curves and costly ongoing support..

To better understand XML as a markup technology, it actually helps to characterize it with respect to its sibling, HTML, and its parent, SGML. SGML establishes the rules by which any system of markup, such as HTML, can be created. Thus the various forms of markup such as XML and HTML are descended from SGML (in fact, you could characterize HTML as a prodigal son to SGML, with XML as the wiser son-see http://vedavid.org/xml/luke/ for a "marked-up" version of the well-known parable).

HTML is not good for data storage because it only describes what information should look like, not what it actually is. This focus on format in HTML tags is also problematic because the code can be inconsistent, even haphazard. For example, paragraphs with starting "<p>" tags frequently do not have closing, or ending "</p>" tags. Sometimes HTML attributes have quote marks, sometimes they don’t. Sometimes tags will be capitalized, others won’t be. This is why, for good reason, HTML has not been recognized as an archivable or data-worthy standard.

SGML has always offered a viable contrast to HTML. Its widely-known academic application in the form of the Text Encoding Initiative (TEI) tag set has been a staple part of projects with electronic texts for nearly a decade. TEI is like other markup tag sets, such as MathML in that it specfically addresses needs of particular disciplinary groups. The way these tag sets are defined is by means of a detailed, code-like file which establishes rules for which tags are to be included. These rules also specify where and how tags are included for the overall structure of any given document that conforms to them. These sets of rules are called Document Type Definitions, or DTD’s.

SGML files require these DTD’s in order to be formatted, processed, and-especially-preserved for archival integrity. XML can also use DTD’s, but it can be read and interpreted without them. A related standard, called X-Schema’s, enables an XML file’s data structure to be read and interpreted even without a DTD. For concerns of archiving and institutional memory, this feature is a key innovation over SGML, and was never possible with HTML.

How is XML used?

How XML is used goes hand in hand with why it is important. XML is already widely deployed in the electronic commerce sector because of how well it enables data to be exchanged across different systems and different continents. XML is now a fundamental part of the major software providers for the average user-WordPerfect 9, forthcoming Microsoft Office 2000, and Adobe products, for instance.

XML shares the same archival characteristics as found in MARC and SGML at the character data level. In other words, all three are based on simple character data rather than proprietary or commercially-owned specialized codes or figures. This characteristic is included in what is called "flat data" in archival—as opposed to structural—terms, is recognized as suitable for data storage by the U.S. Government’s National Archives and Records Administration (http://www.nara.gov/records/, cf. additional, gopher://gopher.nara.gov:70/00/about/cfr/records/1228.txt, and "Preserving

Digital Information," the report of the Task Force on Archiving of Digital Information, commissioned by The Commission on Preservation and Access, and the Research Libraries Group, Inc., http://www.rlg.org/ArchTF/tfadi.index.htm). Flat data is also the format chosen for the Oxford English Dictionary and the e-Text Library, Special

Collections and Rare Books at the University of Virginia, (http://etext.lib.virginia.edu).

More important for users of MARC, XML offers a viable conduit for making an existing MARC database interactive with other formats such as Dublin Core or GILS. XML’s strict syntax, in effect, serves the same function as the MARC directory fields. Where the directory fields specify a tag identity, such as "100," as well as the field length and starting point in the record’s sequence of characters, so too does the XML requirement of a start and end tag to identify data set formal parameters on a record’s fields.

What Skills and Tools are Necessary to Work with XML?

Most of the skills that an information sciences professional already has at his/her disposal will translate directly into the skillset for XML work. Fundamentally, it is not the technology that is as challenging as is the understanding of the data itself. Where MARC, for instance, uses "245" for a title or "651" for a geographical keyword, XML simply uses the descriptive (for the most part) terminology to represent the same material.

Obviously the tools for entering XML data and those for something such as MARC will be different. However, the interface for library sciences can largely be the same. One is still categorizing authors, editors, titles, etc. There is also a whole range of XML software for library scientists such as converters from MARC to XML (e.g., Bob

Pritchett’s "marcxml.exe" http://www.logos.com/marc/marcxml.htm), and even

live online converters from MARC to XML-compliant Dublin Core at the CORC/ Cooperative Online Resource Catalog project’s site (http://corc.oclc.org/ -- the password is preset, it’s free).

For editing XML, there is a vast array of software, much of the best of which is free (cf. http://www.oasis-open.org/ or http://www.xmlsoftware.com). For less than $100, more elaborate packages are available. One such tool which is quite handy comes as part of the new WordPerfect Office 2000 (http://www.corel.com/). For the MacIntosh

there is Media Design In-Progress’ XPublish and Emile (http://www.in-progress.com/). In addition, much of the processing and serving of XML can be done with the Internet standards which are free based on Java, Python, or Perl. Apache web servers support it as well.

If you are already committed to a high-end database system such as Oracle, there is is a full set of XML tools that are free from Oracle’s technet (http://technet.oracle.com/tech/xml/info/plsxml/feedback.htm). A really exciting, Z39.50 resident system stores MARC, XML, and SGML in their native formats and can respond to searches in any of these formats with equal speed. This remarkable tool is called Structured Information Manager or SIM and is made by RMIT (http://www.mds.rmit.edu.au/).

How Should One Prepare to Move from HTML to XML?

The best tool for doing this, called HTML Tidy, is actually free and operates on almost any platform (http://www.w3.org/People/Raggett/tidy/). Tidy will take your HTML file

and turn it into XML-compliant data with strict and proper syntax (called "well-formedness"). If you have large quantities of files, much of this can be automated depending on the relative consistency of your file structure.

One excellent tool for conversions is the related XML standard for transforming data formats, called Extensible Stylesheet Language for Transformations, or XSL/T. XSL/T tools can run on any platform (a ready-made tool, XT, for Windows and Unix is free at http://www.jclark.com/; and one which runs on Python—and thus any platform—is from the 4Thought folks at http://FourThought.com). XSL/T is written according to XML rules and allows you to change all your data with only a relative handful of easy-to-use lines of code (see my forthcoming article at IJTS for an easy introduction

http://www.asiatica.org/publications/ijts/default.asp). Primarily, the main thing to remember is that XML was designed with ease of conversion in mind.

Conclusion

The "extensible" future is bright. XML and its suite of technologies are standards-based and standards-driven. No single company-not even MicroSoft-controls the design. It is also an international standard which means support and internetworking are pre-built-in. If it’s XML, it is network-worthy, archive-ready, and comparatively simple to use. It will actually enable you to do more with what you already have at less cost and less learning overhead. XML enables you to configure and adapt what you already have been doing for both the present and the future without loss of granularity in detail, while still maintaining the option to accommodate and integrate future technological advances seamlessly.