Jonathan Warren studying online social movements from their artifacts

S603 XML Workshop

Dates: Summer 2, June 22-27, 2009
Time: 1:00-3:45pm
Place: LI402 (4th floor, undergraduate stacks), Wells Library, Indiana University, Bloomington
Instructor: Jonathan Warren
  • Office: LI028, at the end of the SLIS faculty hallway, Wells Library
  • Office hours: 4:00-5:00pm, Wednesday and Friday, both the week of class and the following week (schedule an appointment)
  • Email me

Objectives

At the end of this workshop, you should:

  • Understand the structure of XML, and its strengths and weaknesses as a tool for structuring data and information;
  • Be able to create well-formed and valid documents using XML tags, DTDs, Schema, and XSLT
  • Understand how to use XML to create an open markup language; and
  • Have published at least three linked documents that can be viewed on the web using XML, XSLT, and CSS.

Introduction

First there was Standard Generalized Markup Language (SGML). This markup language has been used in the publishing industry for many years and became an International Standards Organization (ISO) standard (ISO 8879) in 1986. SGML is a tremendously complex language that provides great flexibility for those who can use it to prepare structured text documents. The specification is over 500 pages long and, according to the Cover Pages1:

Conceived notionally in the 1960s - 1970s, the Standard Generalized Markup Language (SGML, ISO 8879:1986) gave birth to a profile/subset called the Extensible Markup Language (XML), published as a W3C Recommendation in 1998. Depending upon your perspective and requirements, the differences between SGML and XML are inconsequential or immense. SGML is more customizable (thus flexible and more "powerful") at the expense of being (much) more expensive to implement. ... For an overview of differences, see James Clark's document "Comparison of SGML and XML"; for other treatments, see references in XML and/versus SGML. As of 2002-07, relatively few enterprise-level projects are started as SGML applications, but many SGML applications implemented before 1999 are still running productively. In some cases, peculiar business requirements favor the use of SGML for certain features that have been eliminated in XML.

SGML is a cross platform language that is used to structure information in ways that allow easy exchange of documents without need for proprietary hardware or software. This is important because, according to Charles Goldfarb 2, one of its creators,

SGML is designed to make your information last longer than the systems that created it. Such longevity also implies immunity to short-term changes -- such as a change from one application program to another -- so SGML is also inherently designed for re-purposing and portability...

The SGML standard defines the requirements for "conforming SGML documents." These requirements are remarkably flexible. In fact, SGML isn't so much a standard for "what you have to do" as a standard for "describing what you've done and why you chose to do it".

HTML is an application of SGML. It is a fairly rigid document type definition (DTD) of SGML that greatly simplifies the language. As a consequence, it has become a universally understood publishing language which all computers on the web can potentially understand. In spite of its ease of use as a tool for web-based information design, however, HTML has its limitations, in no small part because it is a language designed to affect a document's structure and not its appearance. In addition, there is the problem of the lack of control the designer has over how a given page is displayed on a person's machine.

One solution to these problems has been the introduction of cascading style sheets (CSS), by the World Wide Web Consortium. CSS separates structure from presentation and provide designers with the ability to control elements of HTML markup on many content pages from an external template. CSS is rule-based and uses a syntax to specify how a particular HTML element (affecting text, font selection, images, spacing, white space, color etc.) will appear; CSS markup can be applied to groups of HTML elements, which can be defined in non standard terms, nested HTML elements, and even discrete blocks of text.

However, CSS style sheets still work within the DTD for HTML and, although they extend the control of the designer, they still have the same limitations as does standard HTML.

XHTML is the latest version of HTML. It is a "reformulation of HTML 4 in XML 1.0" that became a recommendation in 20003 This means that XHTML is written as an application of XML and therefore follows all of the rules of XML. It is a cleaner and less forgiving version of HTML and as such, will be compatible with XML applications.

What is XML?

Extensible Markup Language (XML) is a subset of SGML. It has been designed to incorporate only those elements of SGML that are needed to prepare and deliver documents across the web or other communications infrastructure, such as an intranet. It is a language that is used to describe documents, not render them. XML is very powerful because is it a metalanguage which allows users to define their own tags and attributes that can be easily processed and displayed across platforms. An XML document is "self-describing", meaning that it contains all of the rules and tags necessary for it to be displayed. XML extends the power of markup beyond HTML because it incorporates new ways of handling styles (Extensible StyleSheet Language, XSL), links (Extensible Linking Language, XLL), and even querying (XQuery & XPath)

XML was developed by the SGML Editorial Board formed under the auspices of the W3C beginning in 1996. According to the W3C4, the design goals for XML are that:

  • It shall be straightforwardly usable over the Internet.
  • It shall support a wide variety of applications.
  • It shall be compatible with SGML.
  • It shall be easy to write programs which process XML documents.
  • The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  • XML documents should be human-legible and reasonably clear.
  • The XML design should be prepared quickly.
  • The design of XML shall be formal and concise.
  • XML documents shall be easy to create.
  • Terseness in XML markup is of minimal importance.

This 1.5 credit workshop will provide you with an intensive, hands-on introduction to the use of XML to mark up and publish documents on the WWW (or on your web-based internal intranet). You should also gain a conceptual understanding of the structure, strengths, and weaknesses of XML, which will allow you to use this language effectively and efficiently.

XML is beginning to have an impact across the information professions. This workshop gives you the opportunity to see what the buzz is all about.

This workshop is also an excellent thing to take before Dr. Walsh's or Dr. Ding's courses.

Prerequisites & degree requirements

SLIS students must have completed L571/S532 Information Architecture for the Web

Students outside of SLIS and people outside of the University must demonstrate to the instructor an adequate knowledge of XHTML Strict and CSS.

Note: This introductory workshop does not satisfy any of the programming requirements of the MIS degree, as only one week is spent on a scripting language (XSLT). Several courses by Drs. Yang and Walsh satisfy the programming requirement.

No required texts

There are no required texts for this course. A large amount of lecture notes will be provided online.  Students wanting more than the lecture notes are encouraged to browse their favorite bookstore for one or several good books covering XML, DTDs, Schema, XLL/Xpointer, XPath, and XSLT at an introductory level.  The following book is officially recommended, but please look for one that "speaks to you" (after all, they all cover something which is supposed to be standardized):

Castro, E. XML for the World Wide Web. Berkeley, CA: Peachpit Press.  It's under $20.00 (barely)...

References

 Cover, R. (2006). SGML and XML as (Meta-) Markup Languages. Cover Pages
http://xml.coverpages.org/sgml.html
 Goldfarb, C. (1997). Charles F. Goldfarb's SGML Source: InFrequently Asked Questions (InFAQs)
http://www.sgmlsource.com/infaqs.htm
 World Wide Web Consortium. (2002). XHTMLª 1.0 The Extensible HyperText Markup Language (Second Edition)
http://www.w3.org/TR/xhtml1/
 World Wide Web Consortium. (2006). Extensible Markup Language (XML) 1.0 (Fourth Edition)
http://www.w3.org/tr/1998/REC-xml-19980210#sec-origin-goals
Archived in