Recently, I was a consultant at the University of Iowa charged with the task of designing and implementing a strategy for Electronic Theses and Dissertations, or ETD's. That process culminated in the successful deposit in May, 1998 of the first two Extensible Markup Language (XML) theses, composed and coded by the graduates. My observations and experiences with markup technology during that time form the primary basis for this paper.
If you know anything about how HTML or SGML works, you can learn XML with very little trouble. Even better, one of its design principles was simplicity for the software to work with it, and many of the best programs are free or less than $100. Beyond this, I will not be trying to sell you on XML. It is already being woven into WordPerfect, MicroSoft, Adobe, and most every other major software. I will, however, endeavor to apprise you of the issues and opportunities with XML for research in the humanities, specifically viz. Electronic Theses and Dissertations/ETD's.
This paper is divided into 5 main sections. The first section concerns the general context of digital research technology in the humanities, with regard to markup technology. The second is a general overview of issues related Electronic Theses and Dissertations (ETD's). By way of suggesting a potential solution for many of these issues, the third section introduces Extensible Markup Language, or XML. In the fourth section, the role of XML in the University of Iowa's 1997-1998 ETD pilot is reviewed. This leads, in turn, to the final section offering reflections upon the lessons learned with XML and ETD's at Iowa.
I. Digital Alchemy and the Academy
Research and technology in the humanities is still undervalued in the minds of many tenure and peer review committees.
By "technology in the humanities," I am referring to mark-up based technology--whether that be SGML ("Standard General Markup Language"), HTML or the new ML on the block, XML. Mark-up based technology includes the range of tagging styles, software, and related technologies which enable navigation, manipulation, and study of text resources in ways not afforded, or not conveniently possible, with conventional analogue (or, paper) formats.
SGML is an international standard (ISO 8879) which establishes rules for writing the markup code. HTML is one set of rules written according to SGML. In effect HTML is SGML. So is the set of tags in the Text Encoding Initiative, or TEI. So is Mathematical Markup Language, so is Chemical Markup Language, so is DocBook, and, so is XML.
As a term, "Markup Technology" is somewhat under-representative of the range of skills required for a scholar in the humanities to wade into this field. The difficulties in undertaking this kind of "digital alchemy" are compounded by the fact that most of the text research technology developed along lines primarily influenced by mathematical abstraction as with most of computer programming (cf. Russell and Norvig, Artificial Intelligence, A Modern Approach, Prentice Hall, 1998, pp. 5f., 21ff.). This presents a set of demanding learning curves for both the faculty and the aspiring graduate student which, I am suggesting, has the promise of being better negotiated with the advent of XML.
One procedural way in which this learning curve its demands on developmental time can be managed is by presenting markup technology to graduates early in their programs. In this way, their research assistantships can offer an opportunity for faculty to begin to know the techniques. This also provides a horizontal, rather than top down dissemination model for technology. This is more compatible with tradition peer-assisted learning models in the academy. At the same time, the graduates will begin early in their careers to build their research as an evolving digital resource tailored to their disciplinary interests. Markup technology will also keep their life's work both accessible and safe from unmanageable size and incompatible file formats. As a correlation, it should be noted that the current wholesale adoption of XML by the electronic commerce sector is because it makes storage, conversion, and use much simpler.
II. Electronic Theses and Dissertations- Preparation, Presentation, and Preservation
The development of the word processor has arguably been one of the greatest impediments to cogent narrative expression since MTV. The subversion of composition to the facade of formatted presentation is at the heart of the writing and rendition issue. This also causes problems for the integrity of file formats for saving and using between different machines. These problems are not limited to academic writing. This is an industry-wide problem which suddenly came of age with the internet and the frequent need to swap drafts of papers and take advantage of internet-enabled long-distance collaborations. For ETD's, this is a matter which raises complications for Graduate thesis checkers, library archiving, committee review of drafts, and departmental requirements.
The mechanisms of paper document writing and rendition (i.e., presentation and formatting) have a long history which directly affects the ETD process. This history has dominated the last 30 years of the computer industry's development of word processing technology. Exchange of and access to the digital document has remained secondary to a range of gizmos and submenus whose primary functions are printing 1.5" margins, multiple fonts, running headers, tables, graphs, bibliographic data, and document indexes. If an ETD is anything more than simply electronic paper, substantial issues are raised spanning the gamut of the thesis process from submission and review of drafts to the Graduate Examiner's office to library archiving. In the final section of this paper, I will return to the following few representative questions in consideration of ETD's:
To attempt an answer some of these questions, enter XML.
III. Extensible Markup Language (XML)
First of all, I cannot offer you comfort with the prognosis that XML will remain in a quiet, specialized corner of the academy as has SGML. Unlike SGML, XML has been embraced by the dominant forces and players of electronic commerce. XML is already part of how Amazon.com works, it lies behind the current explosion of data-savvy hand-held wireless devices, it's already resident in WordPerfect Office 2000 (a great tool to start with, by the way), MicroSoft is building it into Office 2000, and Adobe is adding it. XML is not developing. It's here. We still have time to shape its inevitable ubiquity in the academy, however.
ETD's with XML markup technology are one avenue by which to accomplish this. Beginning in the graduate programs, early training in markup technology can facilitate the spread of the technology from a sector of the academy that is not yet burdened by the demands of publication, tenure, and committee duties. Graduates, in turn, have interraction with both faculty and undergraduates, so they afford a valuable point of contact for two major components of the academy. The technology itself is quite simple - hence its rapid adoption.
If you are familiar with HTML or SGML, XML is not that different (see the last section of your handout called "Simple Example of XML vs. other tagging like HTML"). Angle brackets, or less-than/greater-than symbols, are used to indicate an element of a document which can be described by a general name, such as one to designate an author. These brackets and the information they contain are usually called tags. They generally appear at the beginning and immediately following whatever it is that they are marking. The tags are usually only visible to the computer or processing software. If it sounds simple, it really is. This basic format has remained unchanged for over 15 years and shows no indications of doing otherwise. There are complications to this simple model nonetheless.
Suppose you see italics in a sentence. Is it a foreign term? Is it the title of a book? Is it put there for emphasis? If you are reading it in a browser as HTML it only describes what the text looks like, such as italics. With very few exceptions, it is a display language only. Ironically, the medium of print is equally ambiguous in this way.
Certainly TEI will distinguish between these different uses of italics. Depending on the reason for the italics, you can tag it as foreign, or you can tag it as a title, or tag it as emphasis, depending on the context (cf. handout example). Among other things, this makes researching more productive. For instance, it weeds songs by Olivia Newton-John out of a search for "Xanadu." Of course this is a rather pedestrian example and TEI/SGML allows you to do a lot more.
So, if everyone would just use it, why isn't TEI enough anyway? Why mess with still more tag sets such as those allowed by XML?
With XML, you control display separately from the actual composition and semantic identification of the content of your document. XML steamlines the numerous variables of syntax which complicated the use of SGML. As a result, documents in XML can be easily converted to other formats, stored, transferred, or re-structured for study and editing. All this while still being as simple as HTML, but remaining infinitely more representative of content. XML is adaptible and more readily convertible to other forms, while still retaining the best features of "academic SGML" such as TEI.
TEI, is an all-purpose academic form of SGML. TEI describes document structure in addition to some semantic content. But, like the age-old adage, one size does -*not*-- fit all.
Take for example Iowa's Ed Folsom and his Walt Whitman digital archive project with Kenneth Price at the University of Virginia (http://jefferson.village.virginia.edu/whitman/). I was talking with Ed at Iowa's Obermann Center for Advanced Studies--where most of this paper was conceived--and he told of their use of TEI for tagging all the many kinds of Whitman resources they have. Among countless other things, Walt Whitman wrote memoirs about the Civil War. He wrote these memoirs expressly drawing upon journals from the Civil War. TEI has a journal entry tag. It does not have a memoir tag.
Immediately we can see a methodological problem, as well as a challenge to basic literary critical theory. To add markup to a text is to rewrite the text. Markup glosses, it is often commentary, and it adds interpretation. These are serious issues for the notion of rendition integrity for primary and even secondary research sources.
If this discussion does not sufficiently persuade you of the utility of XML technology so that you begin participating in its development, then perhaps a sense of how very few individuals are overworked creating the semantics of text markup, will persuade you to participate. It is essential to ensure an adequate diversity of voices (see your handout for links to relevant points of dialogue on the internet email lists, or you can roll your own).
The Whitman project example is illustrative of why XML offers an improvement over strictly-limited tagging systems. It is the word "Extensible" in XML's name that holds the key. For the Whitman project they simply convinced the TEI expert--after some discussion--to add a memoir tag. This can be done with technologies like subdoc, or Architectural Forms, but then you're increasing the learning curve. Ultimately, XML schema's could well solve all this.
SGML, HTML and other flavors of markup always require a specifically prescribed tag set and document tagging structure called a Document Type Definition or DTD. XML also has DTD's, but it can work effectively without them, or the documents themselves can include what amounts to an addendum to the DTD. You can add a declaration of additional elements or new conceptual categories specific to a given academic study in the header of the document.
For Electronic Theses and Dissertation's, XML is amenable to their very underlying principle: new research, innovative research, identification of, research into, and oral defense of something new (ideally, that is). How can any set of tags for categorizing information which were developed ten years ago adequately account for the preparation and expression of new research?
This is the primary semantic contribution of XML. XML is all about what the text is all about. In fact, it is been argued that XML is nothing about presentation and everything about content. Writing and rendition are effectively separated to better facilitate focus on content in composition. With XML, we are closer to the environment of composition wherein basic ideas are composed on a blank sheet and then annotated in the revision process, with a final rewrite for presentational style, or rendition. With XML, thought must first go into the cogent structure of nested and contextualized rhetorical expressions instead of into which font style for a sub-heading level, or under which menu the conversion of footnotes to endnotes is hiding.
More important for the notion of academic freedom and non-commercial non-proprietary control of scholarly expression is the fact that XML is an open standard. Much of the very best software that writes, reads, and tests it is free. It's archival integrity as flat data--no special symbols or characters and direct readability--has been attested as such by the U.S. Government National Archives and Records Administration (cf. study referenced in your handout under "XML is Flat Data"). You can open an XML file on a computer that is 10 years old and still read it because XML is written with no special or corporation-controlled characters.
The style and appearance of an XML document is determined by a separate style sheet. For various kinds of documents, such as different mainstream journals, a standard document style sheet is used to simplify the process and allow the author to concentrate on clarity of expression instead of margins and running headers for camera-ready copy. The same can certainly be done for the fairly generic style of dissertations.
IV. XML, ETD's and the University of Iowa
The preceding sections reflect the core of the issues we encountered in the ETD project at the University of Iowa. As we've found subsequently at other schools around the country and around the world, the issues which were the most challenging had to do with balance between archiving and the rendition, or appearance, of the ETD. The research, content, writing, faculty review, and potential benefit of technology to the university's graduate education program prooved secondary to what the dissertation looked like.
What we did at Iowa was to take a group of seven volunteers from different disciplines and began teaching them how to convert their existing work from word processor formats to a more generic Rich Text Format (RTF). Most word processors do this under the "Save As" menu. From there we had them convert the RTF into a generic XML. The student then continued working from this format to edit and write in XML according to the college's approved Document Type Definition. They were required to purchase no software. Many of them had never worked even with HTML.
Many of the students wanted to write in their familiar word processor. One student who is currently completing her dissertation is writing completely in XML on a Macintosh about poetry had questions about format for the various verse styles. XML enabled us to finagle a balance in cost, archivability, and format.
In essence, we used a TEI-type of tag set for the basic document structure, with additions for the Iowa-specific parts of the front matter such as the certificate of approval. We wrote Cascading Style Sheets (CSS) which then allowed the documents to display in web browsers and print out very close to traditional paper format styles. In fact, with most browsers you can use the print-to-file feature and choose a PDF format, so you can easily get Adobe PDF from XML if you want it.
For older browsers, we wove in traditional HTML tags so that professors would not have to download and install new software to view the documents. A simple modification to a configuration file on the university server enabled browsers to read the XML files without converting them. Certainly more elaborate schemes are possible, but this model was built with an eye to the best balance between recognized standards, archivability, teachability, simplicity, and extremely low cost (cash outlay for software and hardware remained less than $5,000). All related materials are available online.
In recent months, more tools, such as Extensible Stylesheet Language for Transforming document formats (XSL/T), have become standards. This enables a single ETD in one form--like TEI--to be easily changed to HTML or other forms with the use of centralized, automated conversion stylesheets. One stylesheet can change an infinite number of gigabytes of dissertation data to accomodate any later format, or newly-developed tag sets for academia. Winding up with a dead-end, or cull-de-sac technology is not a risk with the suite of XML-related technologies.
What is important to underscore here for the development of ETD's is that the primary problems lie not in the technology when XML is chosen. The crucial developmental ingredient, even more than budget, lies in the need for the Graduate College, the thesis examiner, and the library to be completely in synch with one another. Virginia Tech and the University of Iowa have proven this well. Accordingly, Virginia Tech accepts only ETDs--not paper--and the University of Iowa is now working with Michigan to develop XML ETD's as the standard for the CIC association of Big Ten Schools.
V. Closing Observations: Questions revisited
In what little time remains, I want to revisit the questions raised above based upon my experience with XML while at Iowa.
The Cascading Stylesheets (CSS) enabled printouts of XML from browsers so students submitted both digital and paper drafts to their committees. Often the professor could read the digital, and make notes for revisions on the hardcopy.
For archiving, the Graduate College position was "no." Ultimately most logistics related to this issue were deferred to the departments. When the student opted to participate, they had to secure approval from all thesis committee members which stipulated that the committee approval would attest that the canonical version was the digital one. In disciplines wherein multi-media content is to be expected, faculty are less reticent to view digital materials.
"Validation" or checking the document against the Graduate College's Document Type Definition, or DTD, was a basic requirement, as was a check that all digital links functioned. If links were made to sites elsewhere on the web, it was up to the departmental committee to rule as to the ability of the ETD to stand rhetorically should those links change or the sites close down. A visual scan of the document was also made to be sure it displayed and the media materials functioned.
Flat data. Markup technology does not use proprietary or corporate-owned formats. It is also guaranteed for future compatibility with subsequent formats due to adherence to international standards. XML is currently the best format for both forward and backward compatibility (cf., again, U.S. National Archives and Records Administration on your handout)
It is hoped that this paper has contributed to the ongoing call for such work. Bolter, Landow, Aarseth and others have begun, but the specific applications to academic discourse remain largely unexplored for both methodological and epistemological aspects of this question. In a longer version, this paper explores digital technology in the paradigm of rhetorical inquiry as an effort to articulate a framework for such analysis.
An example of this would be a forthcoming electronic dissertation at Iowa, on which Bolter and others are committee members, now being prepared by Sarah Coggins. I would like to close by noting that this is perhaps the most ripe, exciting field of uncharted rhetorical inquiry and modern language analysis to arise since the computer's arrival this century. Fittingly, not until XML and its rich arsenal of linking, marking, networking, and storage advantages can such inquiry be effectively undertaken. The possibilities are . .. Well . .. Quite extensive.