Handout for 1999 MLA Paper:
"Digital Alchemy: Research, Writing, and Rendition-
Electronic Theses and Dissertations in XML
John Robert Gardner
ATLA-CERTR
Emory University
All the links you need to know to get started are included below. A discussion of flat data, XML, and archiving follows, with a short example of XML tagging to conclude this handout.
- Complete two-day tutorial as part of the Networked Digital Library of Theses and Dissertation's (NDLTD) national workshop. The technology is easily grasped in these sessions, and, while the focus is on ETD's, the software and scripts which you will received are easily adaptible. This workshop takes a dissertation from a current state of writing in a word-processor to conversion to XML, to writing XML, to submission, conversion to multiple formats, and generating library MARC and other archiving records. The conference, in turn, augments with administrative discussions.:
http://etd.eng.usf.edu/Conference/tprogram.htm
- When they say 'Oasis', it is.
It's a real one-stop shop. I regularly peruse the "what's new" section:
http://www.oasis-open.org
They also have a specific set of applications of XML (and SGML) in academia which reads as a digital who's who (The Chronicle's InfoTech section would be green . . . . ):
http://www.oasis-open.org/cover/acadapps.html
- XML.com this page has great reviews of
software (you can add your own) and a fully annotated, interactive section
which explains XML in detail, at
http://www.xml.com
- The "new kid on the block" but a well-seasoned one with the pedigree of Oasis, check
http://www.xml.org
- Perhaps the best sorting and arrangement of the latest in XML software, James Tauber's site has links to other great sites too,
http://www.xmlsoftware.com/
- More links and a few resources from the Iowa project:
http://vedavid.org/xml/
XML is Flat Data for ETD Longevity and Access
Flat data is essential to the integrity of library, information, and
archival considerations, and has been canonized as the fundamental
reliable format by key organizations in the information management
world. It meets multiple criteria:
- Institutional memory (e.g., archives, etc.)-- XML meets the U.S.
Government's digital and archival records requirements. Expressed
simply, this is "flat data"--files can be read and the data within them
is accessible without specialized software.
-
Favored by the http://www.nara.gov/records/, National
Archives and Records Administration (NARA) Electronic Records Work
Group. Cf. additional
gopher://gopher.nara.gov:70/00/about/cfr/records/1228.txt, preservation report\.
- "Preserving Digital Information," the report of the Task
Force on Archiving of Digital Information, commissioned by The
Commission on Preservation and Access, and the Research Libraries Group,
Inc., http://www.rlg.org/ArchTF/tfadi.index.htm
- Flat data is the format chosen for the Oxford English Dictionary and
the e-Text Library, Special
Collections and Rare Books at the University of Virginia, http://etext.lib.virginia.edu
- eXtensible Markup Language is designed for compatibility with existing
SGML resources (many of which are research resources likely to be included
in theses and dissertations anyway). In addition, this compatibility is
insured for the future (it's even Y2k "compliant") due to the simplicity
of conversion to other formats enabled by XML due to its structural consistency and
integrity (known technically as "well-formedness"), see http://www.w3.org/TR/REC-xml, and http://www.textuality.com/xml/faq.html
- XML documents can
meet international, commercial and a growing host of academic disciplinary standards, while still remaining expandable. Thus it is ideally suited for dissertation and thesis work which, by
their nature, explore new areas. Accordingly, XML allows scholars to
add new identifiers and tags particular to their discipline. Several
disciplinary standards already exist:
- Mathematical Markup Language (MathML), http://www.w3.org/Math/
- Chemical
Markup Language (CML), http://sun01.iigb.na.cnr.it/omf/cml-1.0/doc/faq/
- Standards for DNA research, Theology,
Asian MSS, newswriting, commerce and business, as well as a multitude of commercial applications are under
development. http://www.oasis-open.org/cover/
for the latest news.
Based on these considerations, our pilot builds upon the existing standard of the Text Encoding
for Initiative (TEI, lite version) of information designed for the
particulars of academic needs for preservation and access to electronic data.
The TEI standard is an implementation of the internation standard (of which
hypertext, or HTML, is a part) called Standard General Markup Language
(SGML), ISO 8879. All three systems can be reviewed at the http://www.w3.org"/World Wide Web Consortium site for
electronic information resources. For more detailed annotated information on
XML, SGML, and TEI, see the http://www.oasis-open.org, the OASIS site
and the http://www.xml.com site. For software resources,
see http://www.xmlsoftware.com/, James Tauber's site.
Simple Example of XML vs. other tagging like
HTML
In short look at this sentence:
In Carlo Michelstaedter's Persuasione
e rettorica, there is a highly original
treatment of
modernity
and extremes of fin de siècle angst.
We see three italicized text portions:
- One portion, "Persuasione
e rettorica," uses italics to indicate "title."
- Another portion,
"highly original," uses italics for "emphasis."
- And finally, we have "fin de siècle ," where italics
indicate "foreign" language.
Conventional wordprocessing has made such ambiguity "standard." Anyone
who has had to reformat a document from one publisher--or software--to
another knows how frustrating this can be. XML makes the identification
of information much more specific. So, in the example above (which is
done in normal HTML), the
phrases have really only been marked (in what the computer reads)
with ambiguous italics :
<i> highly original
</i>
This kind of marking assumes every reader knows what italicizing
signifies, and that all computers can read it (which, we all know, is
not often the case!!). With XML we
would have a different marking that you would tell the computer about each
part (in sort of the same way you'd tell it to italicize
something):
-
there would be an
"emphasis" identifier (you may notice it looks a lot like HTML style):
<emph> highly original </emph>
- there would be a "title" identifier:
<title> Persuasione e
rettorica </title>
- there would be a "foreign" identifier:
<foreign> fin de siècle
</foreign>
Aside from the obvious increase in precision, there are other
advantages. For instance, anything you ever write using these tags (or
ones like them for your discipline, for instance) can be
reformatted--however many documents you've done this way--simply by
telling one file to "make all titles italic, all foreign underlined," etc.
You never have to re-format the content itself, just tell your computer
what you want it to do with all "title" parts, or "emphasis" parts, etc.
Thus your original content is always "safe" from later re-formatting. You
don't have to risk damaging your composition just to change its
rendition. Plus XML takes up less disk space than most
word-processing documents, it's Year 2000 safe,
and ANYONE can read them on their computer without special softare. In other words, in TEOTWAWKI ("the end of the world as we know it") XML is a safe format.
You can also use this for more precise searching. You can choose all
occasions of Shakespeare as author, or distinguish between a search for
Coleridge's Xanadu and the song by Olivia Newton-John. Check out the
links above to learn more, or go back to the top.
(© Copyright 1999, John Robert Gardner, All Rights
Reserved.)