This information is being maintained for archive/historical purposes only.
It will not be updated.
Please see http://archive.cabinet-office.gov.uk for details.

 

Main navigation

In section navigation

5.4.1 Extensible Markup Language (XML)

5.4.1.1 Introduction to XML

XML is not a language. It’s a syntax! It defines the punctuation necessary to create correct examples of XML documents.

Correct sentences in English start with a capital letter and end with a full stop.

this sentence is incorrect
This sentence is correct.

Correct elements in XML have a start–tag and matching end–tag.

<ThisIs></Incorrect>
<ThisIsCorrect></ThisIsCorrect>

A start–tag has an opening <, a tag name and a closing >. An end–tag has an opening </, a tag name and a closing>. The tag name will be the same for the start–tag and end–tag.

There are other rules but not that many – most of XML’s design principles revolve around keeping the specification as simple as possible. The rules are set out by the W3C (World Wide Web Consortium) in the XML 1.0 Recommendation www.w3.org/TR/REC-xml [External webiste]

Well–formed XML Documents

An XML declaration specifies the version of XML being used:

<?xml version="1.0"?>

An element has a start–tag and matching end–tag and will be either a container for data, like the element <Name></Name> in the following:

<Name>Lindsey Brown</Name>

or a container for other elements, like the element <Names></Names> in the following:

<Names>
<FirstName>Lindsey</FirstName>
<SecondName>Brown</SecondName>
</Names>

The root element is a starttag and matching end–tag that contains every other element in the document.

Empty elements should use an empty–element tag. Thus, our first example is better expressed as

<ThisIsCorrect/>

Note the /> at the end of the tag which indicates an empty element.

The term properly nested means that if a start–tag is inside an element, the end–tag must be inside the same element. The following example is not well–formed:

<ThisIsNot><WellFormed></ThisIsNot></WellFormed>

To correct this, use either:

<ThisIs><WellFormed></WellFormed></ThisIs>

or:

<ThisIs><WellFormed/></ThisIs>

Example XML Document

The following is a complete example of a well–formed XML document representing an employee record:

Example 1.1: Employee.xml

<?xml version="1.0" encoding="utf–8"?>
<Employee id="91710">
<Name>Lindsey Brown</Name>
<Address>
<Line>Grenada House</Line>
<Line>150 Beaconsfield Road</Line>
<Town>London</Town>
<Postcode>SW1V 1LQ</Postcode>
</Address>
<Salary>30</Salary>
<JobTitle>Executive Officer</JobTitle>
<JobDescription>Supervise a small team of staff and day to day
management of the Accounts Payable section.</JobDescription>
<Email>lbrown@culture.gov.uk</Email>
<Department>Department for Culture, Media and Sport</Department>
<Phone>020 7421 3423</Phone>
</Employee>

The root element is Employee which has an attribute id with a value of 91710. Attributes are normally used for simple pieces of additional data about the content within an element – such as a reference number. In XML, attributes are always quoted. Single quotes or double quotes may be used.

Case–sensitivity

XML is case–sensitive which means that JobTitle is not the same as jobtitle. Consider the following:

<JobTitle>Executive Officer</jobtitle>

This is not well–formed because the start–tag and end–tag do not match.

Character Encoding (Unicode)

The addition of encoding="utf–8" in the XML declaration above signifies that the document uses the 8–bit Unicode character encoding. Unicode is an international standard for encoding text in any human language for use in computing. It overcomes the limitations of traditional character sets such as ISO–8859 which largely deal with European languages. As a result, Unicode has found widespread support in the internationalisation of software.

UTF–8 is the default encoding for XML documents and covers most scripts (it doesn’t support ideographic scripts such as Chinese, Japanese and Korean). UTF–8 is the UK Government standard for encoding XML files.

For more information on Unicode refer to the website www.unicode.org/ [External webiste]

XML Names

In addition to the basic structural rules, XML also restricts the names of elements and attributes. Names can only consist of letters, numbers and the punctuation characters _ (underscore), – (hyphen), . (full stop) and : (colon). In addition, names may only start with letters or underscores.

Note: In UK Government, element names should be in upper camel case. Upper camel case names start with an initial capital, then each new word within the name starts with an initial capital. Where an all uppercase abbreviation (such as UK) or a digit is incorporated into a name, the following word should start with a lower case letter.

Examples: <UKaddress> but <UnitedKingdomAddress>

Further Information

This is a considerable simplification of the XML specification but it is enough to get a feel for what XML documents are. For more information, refer to the W3C Recommendation www.w3.org/TR/REC-xml [External webiste]

5.4.1.2 Using XML

The employee record in the previous section was an example of using XML for the description of data, an example of structured information. XML and its related technologies have many uses aside from data description: data exchange, presentation of information, querying data, information modelling and web services. All of these uses will be covered in this chapter, with a particular focus on Web publishing and the link with UK Government standards such as the e–Government Interoperability Framework (e–GIF).

Data Exchange

XML was primarily designed as a lightweight data exchange – or messaging – format for transmission over the internet. Released in 1998 as a simplified version of SGML (Standard Generalised Markup Language), its popularity has burgeoned due to its simplicity, its similarity to the widespread HTML (Hypertext Markup Language) and its ability to be read easily by both humans and computers. Now you can find XML in almost every operating system, software application and programming language. It has far outstripped its originally intended use and is now used in areas as diverse as news syndication, server configuration files and vector graphics.

Presentation of Information

XML data can be processed by a variety of tools into a form that is presented to the user via an interface. This is known as XML rendering. For example, an XML document can be transformed into a Web page and presented to the user via a browser. This process is covered in more detail in section 5.4.2.1, XSL Transformations (XSLT).

Querying Data

As we have already mentioned, XML may be used to describe data. In order to retrieve data from an XML document, the people at W3C came up with XPath (XML Path Language). This is a query language that resembles the notation for addressing files in some operating systems. XPath is covered in section 5.4.2.2.

Information Modelling

So far we have talked about structured information but not about ensuring that the content of our XML documents is of the correct form. For example, if we receive an XML document representing an invoice, we might want to ensure that the date consists of a day, month and year and is not just a garbled piece of text. Otherwise we may not be able to enter it onto our finance system. This means validating the XML as well as ensuring that it is well–formed. The following is well–formed:

<InvoiceDate>Ghfjdksl:</InvoiceDate>

However, this is well–formed and valid:

<InvoiceDate>2005–07–05</InvoiceDate>

To enable validation of XML data, we use XML schemas. These are covered in section 5.4.3, Information Modelling.

Web Services

In a very simple scenario, a user requests a Web page via a browser and receives XHTML – HTML that follows the XML syntax. This is an example of XML data exchange over the internet. In a more sophisticated scenario, an organisation may request information, such as a stock quote or the weather, from a third–party provider for publication on its website. This type of business to business (B–to–B) transaction has been made possible by the family of XML–based technologies called Web Services. We will take a look at Web Services in section 5.4.4.1.

5.4.1.3 The Benefits of XML

Separation of Content and Presentation

The key benefit of XML is the fact that one document can be delivered to the user in the way that is most suitable to the delivery channel in question – mobile phone, PDA, digital TV, Web and so on.

Figure 1.1: XML promotes multiple channels of delivery

The concept behind this is that in XML we mark up semantics rather than structure. We describe the meaning of things as opposed to the way they look. In HTML, tags such as <p>, <h1>, <ul> and <br> help us to format content on a web page. For example, a page representing a film might be authored as:

<html>
<head>
<title>Star Wars (1977)</title>
</head>
<body>
<h1><strong>Star Wars</strong></h1>
<br>
<b>Directed by</b><br>George Lucas<br>
<br>
<b>Written by</b><br>George Lucas<br>
<br>
<b>Genre</b><br>Sci–fi<br>
<br>
<b>Tagline</b><br>A long time ago in a galaxy far, far away... <br>
<b>User Rating</b><br>9/10<br>
<br>
</body>
</html>

In XML, we would have something like the following:

Example 1.2: Film.xml

<?xml version="1.0" encoding="utf–8"?>
<Film>
<Name>Star Wars</Name>
<Director>George Lucas</Director>
<Writer>George Lucas</Writer>
<Genre>SF</Genre>
<Tagline>A long time ago in a galaxy far, far away...</Tagline>
<Rating>9</Rating>
</Film>

The advantage of the XML is firstly, that we know what it means intuitively because we can read it and we do not have to learn a new language. Secondly, we can repurpose this content – via XML rendering (see Presentation of Information above) – into the desired form. This could be a web page but it might equally be a text (SMS) message or a PDF. This is what we mean by separation of content and presentation. In HTML, we do not have this ability because the meaning of the information and the way it is presented to the user are intermingled.

Open International Standard

XML is an open standard. This means that the specification is free and publicly available. It is also an international standard managed by the World Wide Web Consortium (W3C). It is available in several international written languages (including English) and is based on the international character encoding Unicode – so XML can be written in any language. For these reasons and due to its massive market support, XML has been adopted as the government standard for data integration. Section 1.6 of the e–GIF (e–Government Interoperability Framework) cites as a key policy:

‘adoption of XML as the primary standard for data integration and data management for all public sector systems’

Government information systems require the capacity to share information and services across large disparate networks – which are often proprietary. Institutions such as the NHS, the police and local authorities are characterised by loosely bound infrastructures with many different ICT support services on different platforms. The same is true of central government departments.

To enable shared information and services across these networks, the interfaces are standardised according to open international standards such as XML. This ensures that, for example, a .NET system communicates effectively with a Java application. To retool proprietary systems is not usually cost–effective. However, to standardise the components that talk to each other is a powerful proposition. Consider the following diagram:

Figure 1.2: XML data exchange

In five years time we decide to migrate System A to a new platform. We do not need to make any changes to System B because the systems communicate using XML. It is this longevity that characterises open international standards and helps us to future–proof our systems.

For a more detailed list of the XML specifications adopted by UK Government, please refer to the Technical Standards Catalogue Table 3 – Specifications for data integration www.govtalk.gov.uk/egif/dataintegration.asp#table3

Data Integrity

In the past, when information was passed between IT systems, file formats such as tab–delimited or comma–separated value (CSV) were used. Like XML, these are text files and can therefore be read by human beings. In CSV format, Employee.xml would appear as:

"Lindsey Brown","Grenada House","150 Beaconsfield Road","London",...

The trouble with this type of file format is that there is no standard way of ensuring that the file is of the correct form (well–formed) and contains the right sort of information (valid). The only way to do this is to write a bespoke program for every data feed we have. This is inefficient and often overlooked. Legacy systems that use CSV, or some other bespoke format, are therefore prone to data corruption.

XML gets round this problem by providing a standard file format such that a reader, or parser, will return an error message if it receives corrupt data. Moreover, parsers will also validate the data against a schema (see Information Modelling below) which ensures that the information is meaningful as well as correctly formed. This promotes the integrity of our data and leads to more accurate and up to date information.

XML parsers are covered in more detail in Appendix A Tools and Processors.

5.4.1.4 XML on the World Wide Web

Extensible Hypertext Markup Language (XHTML)

XML was designed to be compatible with the World Wide Web. One of the earliest XML applications was to formalise HTML into XML–compliant markup – the HTML 4.0 language was tightened up to make it well–formed. This means

In practice this will involve, for example, replacing stray <br> tags with <br/> and closing all <p> elements. The result is XHTML. In addition to making web pages well–formed, they also need to be valid with respect to one of the three XHTML DTDs (Document Type Definitions):

To conform with the Strict XHTML DTD, you need to

This also has the effect of making your web pages more accessible. In particular, Checkpoint 3.2 of the Web Content Accessibility Guidelines 1.0 requires you to:

‘Create documents that validate to published formal grammars. [Priority 2]’

For more information on web accessibility, refer to Section 2.4, Building in universal accessibility.

XHTML 1.0 and the Web Content Accessibility Guidelines 1.0 are both W3C Recommendations (see links at the end of the section).

Viewing XML Documents

XML documents can be viewed through most web browsers (e.g. Internet Explorer, Netscape, Firefox, Opera). This is how Example 1.1 appears in Internet Explorer 6.0:

Figure 1.3: Employee.xml viewed through a browser

Note: XML documents can also be viewed in any text editor.

For your assistance – resources

W3C XML 1.0 Recommendation
www.w3.org/TR/REC-xml

Unicode
www.unicode.org/

Technical Standards Catalogue 6.1 Table 3 – Specifications for data integration
www.govtalk.gov.uk/egif/dataintegration.asp#table3

Web Content Accessibility Guidelines 1.0
www.w3.org/TR/WCAG10/

W3C XHTML 1.0 Recommendation
www.w3.org/TR/xhtml1/

Building in universal accessibility
Section 2.4:Building in universal accessibility.

In section navigation