XML resources for academic application and E-Textnology

Handout for 1999 MLA Paper:

"Digital Alchemy: Research, Writing, and Rendition-
Electronic Theses and Dissertations in XML

John Robert Gardner
ATLA-CERTR
Emory University

All the links you need to know to get started are included below. A discussion of flat data, XML, and archiving follows, with a short example of XML tagging to conclude this handout.

Complete two-day tutorial as part of the Networked Digital Library of Theses and Dissertation's (NDLTD) national workshop. The technology is easily grasped in these sessions, and, while the focus is on ETD's, the software and scripts which you will received are easily adaptible. This workshop takes a dissertation from a current state of writing in a word-processor to conversion to XML, to writing XML, to submission, conversion to multiple formats, and generating library MARC and other archiving records. The conference, in turn, augments with administrative discussions.:

http://etd.eng.usf.edu/Conference/tprogram.htm
When they say 'Oasis', it is. It's a real one-stop shop. I regularly peruse the "what's new" section:

http://www.oasis-open.org
They also have a specific set of applications of XML (and SGML) in academia which reads as a digital who's who (The Chronicle's InfoTech section would be green . . . . ):

http://www.oasis-open.org/cover/acadapps.html
XML.com this page has great reviews of software (you can add your own) and a fully annotated, interactive section which explains XML in detail, at

http://www.xml.com
The "new kid on the block" but a well-seasoned one with the pedigree of Oasis, check

http://www.xml.org
Perhaps the best sorting and arrangement of the latest in XML software, James Tauber's site has links to other great sites too,

http://www.xmlsoftware.com/
More links and a few resources from the Iowa project:

http://vedavid.org/xml/

XML is Flat Data for ETD Longevity and Access

Flat data is essential to the integrity of library, information, and archival considerations, and has been canonized as the fundamental reliable format by key organizations in the information management world. It meets multiple criteria:

Institutional memory (e.g., archives, etc.)-- XML meets the U.S. Government's digital and archival records requirements. Expressed simply, this is "flat data"--files can be read and the data within them is accessible without specialized software.
1. Favored by the http://www.nara.gov/records/, National Archives and Records Administration (NARA) Electronic Records Work Group. Cf. additional gopher://gopher.nara.gov:70/00/about/cfr/records/1228.txt, preservation report\.
2. "Preserving Digital Information," the report of the Task Force on Archiving of Digital Information, commissioned by The Commission on Preservation and Access, and the Research Libraries Group, Inc., http://www.rlg.org/ArchTF/tfadi.index.htm
3. Flat data is the format chosen for the Oxford English Dictionary and the e-Text Library, Special Collections and Rare Books at the University of Virginia, http://etext.lib.virginia.edu
eXtensible Markup Language is designed for compatibility with existing SGML resources (many of which are research resources likely to be included in theses and dissertations anyway). In addition, this compatibility is insured for the future (it's even Y2k "compliant") due to the simplicity of conversion to other formats enabled by XML due to its structural consistency and integrity (known technically as "well-formedness"), see http://www.w3.org/TR/REC-xml, and http://www.textuality.com/xml/faq.html
XML documents can meet international, commercial and a growing host of academic disciplinary standards, while still remaining expandable. Thus it is ideally suited for dissertation and thesis work which, by their nature, explore new areas. Accordingly, XML allows scholars to add new identifiers and tags particular to their discipline. Several disciplinary standards already exist:
1. Mathematical Markup Language (MathML), http://www.w3.org/Math/
2. Chemical Markup Language (CML), http://sun01.iigb.na.cnr.it/omf/cml-1.0/doc/faq/
3. Standards for DNA research, Theology, Asian MSS, newswriting, commerce and business, as well as a multitude of commercial applications are under development. http://www.oasis-open.org/cover/ for the latest news.

Based on these considerations, our pilot builds upon the existing standard of the Text Encoding for Initiative (TEI, lite version) of information designed for the particulars of academic needs for preservation and access to electronic data. The TEI standard is an implementation of the internation standard (of which hypertext, or HTML, is a part) called Standard General Markup Language (SGML), ISO 8879. All three systems can be reviewed at the http://www.w3.org"/World Wide Web Consortium site for electronic information resources. For more detailed annotated information on XML, SGML, and TEI, see the http://www.oasis-open.org, the OASIS site and the http://www.xml.com site. For software resources, see http://www.xmlsoftware.com/, James Tauber's site.

Simple Example of XML vs. other tagging like HTML

In short look at this sentence:

In Carlo Michelstaedter's Persuasione e rettorica, there is a highly original
treatment of modernity and extremes of fin de siècle angst.

We see three italicized text portions:

One portion, "Persuasione e rettorica," uses italics to indicate "title."
Another portion, "highly original," uses italics for "emphasis."
And finally, we have "fin de siècle ," where italics indicate "foreign" language.

Conventional wordprocessing has made such ambiguity "standard." Anyone who has had to reformat a document from one publisher--or software--to another knows how frustrating this can be. XML makes the identification of information much more specific. So, in the example above (which is done in normal HTML), the phrases have really only been marked (in what the computer reads) with ambiguous italics :

<i> highly original </i>

This kind of marking assumes every reader knows what italicizing signifies, and that all computers can read it (which, we all know, is not often the case!!). With XML we would have a different marking that you would tell the computer about each part (in sort of the same way you'd tell it to italicize something):

there would be an "emphasis" identifier (you may notice it looks a lot like HTML style):
<emph> highly original </emph>
there would be a "title" identifier:
<title> Persuasione e rettorica </title>
there would be a "foreign" identifier:
<foreign> fin de siècle </foreign>

Aside from the obvious increase in precision, there are other advantages. For instance, anything you ever write using these tags (or ones like them for your discipline, for instance) can be reformatted--however many documents you've done this way--simply by telling one file to "make all titles italic, all foreign underlined," etc. You never have to re-format the content itself, just tell your computer what you want it to do with all "title" parts, or "emphasis" parts, etc. Thus your original content is always "safe" from later re-formatting. You don't have to risk damaging your composition just to change its rendition. Plus XML takes up less disk space than most word-processing documents, it's Year 2000 safe, and ANYONE can read them on their computer without special softare. In other words, in TEOTWAWKI ("the end of the world as we know it") XML is a safe format.

You can also use this for more precise searching. You can choose all occasions of Shakespeare as author, or distinguish between a search for Coleridge's Xanadu and the song by Olivia Newton-John. Check out the links above to learn more, or go back to the top.