This information is being maintained for archive/historical purposes only.
It will not be updated.
Please see http://archive.cabinet-office.gov.uk for details.

 

Main navigation

In section navigation

5.4.3 Information Modelling

One of the great benefits of using XML as a data exchange format is the ability to validate content against a defined set of rules and data types. As well as ensuring that data is well–formed – or syntactically correct – it is useful to be able to restrict the names of elements and attributes, the order they occur and the permitted values they may have.

Document Type Definitions (DTDs)

Initially, the way to ensure the correct structure of an XML document was through a document type definition (DTD). DTD’s are written in non–XML syntax and place restrictions on the allowable content of an XML document. For example, a basic DTD for Film.xml (see Example 1.2, section 5.4.1) would be:

Example 3.1: Film.dtd

<!ELEMENT Film (Name, Director, Writer, Genre, Tagline, Rating)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Director (#PCDATA)>
<!ELEMENT Writer (#PCDATA)>
<!ELEMENT Genre (#PCDATA)>
<!ELEMENT Tagline (#PCDATA)>
<!ELEMENT Rating (#PCDATA)>

This DTD simply declares a Film element and defines the elements it contains and the order they should occur in: Name, Director, Writer and so on. The contained elements are defined as #PCDATA (parsed character data) which means that they may not contain elements themselves, just simple text.

This DTD would be linked from Film.xml via the DOCTYPE (document type) declaration,

<!DOCTYPE Film SYSTEM "Film.dtd">

assuming that Film.xml and Film.dtd are stored in the same directory.

A parser may then be configured to validate the document against its DTD. (Such a parser is known as a validating parser.) This involves checking every element in the XML document against its definition in the DTD to ensure validity. This is distinct from checking the well-formedness of the document which is only concerned with XML syntax.

The problem with DTDs is that they are very limited when it comes to more complex data models. For example, there is no way to ensure that Rating is a positive decimal number less than ten. To do this, we need a richer document definition language such as XML Schema.

5.4.3.1 Introduction to XML Schema

There are several XML schema languages but the preferred one for use in government systems is the W3C XML Schema language. In the e-Government Interoperability Framework (e–GIF) it states in section 2.15:

‘The W3C’s XML schema recommendation will be the main schema language used for XML-based products and services…‘

XML Schema is preferred to the older DTD (document type definition) because it allows much greater control over the allowed content of an XML message and is namespace–aware. Namespaces are covered in the next section. The XML Schema language consists of three parts:

Employee Schema

The XML Schema language is too large and complex to cover in any great detail in this chapter but it is useful to look at an example to get an idea of the basic constructs available for modelling information. The following is a possible schema for Employee.xml (Example 1.1):

Example 3.2: Employee.xsd

<?xml version="1.0"?>
<xsd:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">

<xsd:element name="Employee">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Name" type="xsd:string"/>
<xsd:element name="Address" type="AddressType"/>
<xsd:element name="PayBand" type="xsd:string"/>
<xsd:element name="Salary" type="xsd:decimal"/>
<xsd:element name="JobTitle" type="xsd:string"/>
<xsd:element name="JobDescription" type="xsd:string"/>
<xsd:element name="Email" type="xsd:string"/>
<xsd:element name="Department" type="xsd:string"/>
<xsd:element name="Phone" type="xsd:string"/>
</xsd:sequence>
<xsd:attribute name="id" type="xsd:integer"/>
</xsd:complexType>
</xsd:element>

<xsd:complexType name="AddressType">
<xsd:sequence>
<xsd:element name="Line" type="xsd:string" maxOccurs="3"/>
<xsd:element name="Town" type="xsd:string"/>
<xsd:element name="County" type="xsd:string" minOccurs="0"/>
<xsd:element name="Postcode" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>

</xsd:schema>

The first thing to notice is that an XML schema is an XML document. We can tell this from the XML declaration and the familiar angled-bracket syntax.

The root element <xsd:schema> has various attributes concerned with namespaces. These are covered in XML Namespaces.

Next we have <xsd:element name="Employee">. This declares that the root element of any XML document that represents a valid employee should be <Employee>.

Note: Documents which validate against a given schema are referred to as instance documents.

The <Employee> element is defined as a complex type through the use of <xsd:complexType>. This means that <Employee> contains other elements and/or has attributes. That is, the content of <Employee> is more complex than simple text. If <Employee> contained just text, then it would be described as a simple type. Next, a sequence of elements is defined:

<xsd:sequence>
<xsd:element name="Name" type="xsd:string"/>
<xsd:element name="Address" type="AddressType"/>
<xsd:element name="PayBand" type="xsd:string"/>
<xsd:element name="Salary" type="xsd:decimal"/>
<xsd:element name="JobTitle" type="xsd:string"/>
<xsd:element name="JobDescription" type="xsd:string"/>
<xsd:element name="Email" type="xsd:string"/>
<xsd:element name="Department" type="xsd:string"/>
<xsd:element name="Phone" type="xsd:string"/>
</xsd:sequence>

This means that the <Employee> element of our instance document must contain a sequence of elements in the exact running order of the above list. W3C XML Schema language provides three primary ways of combining elements together:

These are known as compositors.

In Example 3.2, most of the elements declarations refer to simple types. For example, type="xsd:string" or type="xsd:decimal". These are built-in data types. That is, XML knows what a string or decimal value is. XML Schema provides several other built-in data types such as dates, times, integers and so on. For a full definition of the available types, refer to XML Schema Part 2: Datatypes www.w3.org/TR/xmlschema-2/ [External Website].

The <Address> element refers to a user-defined data type, AddressType.

<xsd:complexType name="AddressType">
<xsd:sequence>
<xsd:element name="Line" type="xsd:string" maxOccurs="3"/>
<xsd:element name="Town" type="xsd:string"/>
<xsd:element name="County" type="xsd:string" minOccurs="0"/>
<xsd:element name="Postcode" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>

AddressType is a complex type which consists of a sequence of address lines, town, county and postcode. It is interesting to note that <Line> has the facet maxOccurs="3". This means that a maximum number of three address lines are permitted. Facets are a way of further restricting the datatypes available to us in XML Schema language.

The fact that AddressType is defined outside the Employee element means that we could potentially use it in other schemas. Therein lies the power of XML Schema: its ability to define reusable datatypes. This idea is explored further in the next section.

Finally, the id attribute is defined:

<xsd:attribute name="id" type="xsd:integer"/>

This means that the <Employee> element must have an id attribute with an integer value.

There is a lot more to XML Schema than we have covered in this section. For a complete definition of the constructs available to data modellers using the XML Schema language, refer to XML Schema Part 1: Structures www.w3.org/TR/xmlschema-1/ [External Website]

5.4.3.2 e–Government Interoperability Framework (e–GIF)

In the last section, we developed an example schema which defined

In effect, we created an employee vocabulary. A vocabulary is a set of element and attribute names associated with a particular markup language, such as RSS, XSLT or XHTML. W3C XML Schema language provides us with a rich language to create vocabularies for any subject area. These vocabularies may be further refined to include rules for element content and structure up to an arbitrary level of complexity.

In UK Government, the risk is that information modelling efforts are duplicated by different organisations. If schemas are developed independently, this can in turn lead to incompatibility. The impact is that data interchange becomes difficult, sharing information becomes impossible and interoperability breaks down.

To minimise this risk, Government has put in place an interoperability framework: the e-GIF. The following diagram illustrates the components of the framework:

Figure 3.1: Components of the e-Government Interoperability Framework

The e–GMS (e–Government Metadata Standard) and IPSV (Integrated Public Sector Vocabulary) are concerned with metadata and beyond the scope of this section. For more information on metadata, please refer to section 1.7, Getting users to use your site: metadata, search engines and promotion. Let us consider the other components in turn.

Government Data Standards Catalogue (GDSC)

The GDSC is a central repository of data standards organised around generic information types such as person information, address, identifiers and so on. Each entry in the catalogue defines a particular standard, such as person name, with details of who owns it, which international (or national) standard it is based on and what the format is. Additionally, each standard is linked to an XML Schema component which implements the information model set forth in the catalogue. A schema component or fragment is a user-defined data type that may be re-used in another XML application. (AddressType is an example of a schema component – see previous section.)

The idea behind the GDSC is that public sector schema developers incorporate these centrally approved schema definitions into their vocabularies. This saves development time and improves the potential for information sharing across government.

The GDSC is an online publication available at www.govtalk.gov.uk/gdsc/html/default.htm [External Website]

XML schemas

In addition to the generic data types covered by the GDSC, the GovTalk website (http://www.govtalk.gov.uk) also hosts a library of domain-specific schemas developed right across the public sector. These have been developed with specific projects in mind but are available for other schema developers to use in their projects. It is also a useful resource just to see other examples of public sector schema development. The GovTalk Schema Library may be accessed at www.govtalk.gov.uk/schemasstandards/schemalibrary.asp [External Website]

GovTalk also provides useful best practice guidance for schema developers, notably the e-Government Schema Guidelines for XML. These cover the elements of schema design, development of schema components and schema metadata. The schema guidelines and other resources may be accessed from the GovTalk Guidance for Developers page at www.govtalk.gov.uk/schemasstandards/developerguide.asp [External Website]

Technical Standards Catalogue

The Technical Standards Catalogue is a detailed list of standards in the broad policy areas laid out in the e-GIF. Of particular relevance to this chapter is the area of Data Integration. This is covered by Table 3 of the catalogue and includes XML, XSLT, XML Schema and other XML–related technologies. It is available at www.govtalk.gov.uk/egif/dataintegration.asp#table3 [External Website]

The e–GIF also covers specific business areas such as finance, health and e-learning. The Technical Standards Catalogue lists specifications in these areas, many of which are XML applications: GML (Geographical Markup Language), XBRL (Extensible Business Reporting Language) and UBL (Universal Business Language) to name but a few. The specifications for business areas are available at www.govtalk.gov.uk/egif/specifications.asp [External Website]. For further information, refer to the e–GIF which can be accessed from www.govtalk.gov.uk/schemasstandards/egif.asp [External Website]

5.4.3.3 XML Namespaces

The purpose of namespaces is twofold:

Note: An XML application is an XML–based markup language, not a software application that uses XML.

Let us concentrate on the second purpose because it is the more important of the two.

In Example 3.2 we created an employee vocabulary which included the element <Name> which was defined as a simple type using the built-in data type xs:string. If we wanted to share employee information with another department or, for example, a centralised HR function, it would be better to use the data standard for Person Name as set out in the Government Data Standards Catalogue (GDSC). As discussed in the previous section, this makes sense in the interest of information sharing and interoperability.

Person Name may be viewed at www.govtalk.gov.uk/gdsc/html/frames/PersonName-1-1-Release.htm [External Website]

The schema component which implements the above is PersonNameStructure – part of the PersonDescriptiveTypes schema – and is linked to from the above URL. PersonNameStructure is a complex type which is a sequence of title, given name, family name, suffix and requested name. The semantics are not important here but for the present discussion, the result is that our instance document (Example 1.1) changes from

<Name>Lindsey Brown</Name>

to

<Name>
<PersonNameTitle>Miss</PersonNameTitle>
<PersonGivenName>Lindsey</PersonGivenName>
<PersonFamilyName>Brown</PersonFamilyName>
</Name>

To enable this change, we need to make several changes to our schema. Consider the following updated <xsd:schema> element:

<xsd:schema targetNamespace="http://www.govtalk.gov.uk/employee"
xmlns:emp="http://www.govtalk.gov.uk/employee"
xmlns:pd="http://www.govtalk.gov.uk/people/PersonDescriptives"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified" attributeFormDefault="unqualified">

First we add the targetNamespace attribute with a value of www.govtalk.gov.uk/employee [External Website] – this value is called the namespace name. This has the effect of associating every element in our instance document with the given URL.

Note: URLs are used as namespace names simply to associate XML vocabularies with a globally unique identifier. They do not necessarily point to a web page.

Next we define a namespace prefix emp:

xmlns:emp="http://www.govtalk.gov.uk/employee"

This is a shorthand so that we can type emp rather than http://www.govtalk.gov.uk/employee when we want to put elements in our newly defined namespace. This is the same as namespace declarations that we have already encountered: xsl for XSLT stylesheets, fo for Formatting Objects files and xsd for XML Schema documents.

Following this, we define a further namespace prefix pd:

xmlns:pd="http://www.govtalk.gov.uk/people/PersonDescriptives"

This is the target namespace of the PersonDescriptiveTypes schema.

The last two settings,

elementFormDefault="qualified" attributeFormDefault="unqualified">

mean that in our instance document, elements should be prefixed by their namespace but attributes need not. This is set out in the e-Government Schema Guidelines for XML as a convention to ensure that developers reading or reusing a schema do not have to trace the internal structure of the schema.

Mixing Vocabularies

To bring the definition of PersonNameStructure into our schema, we need to import the PersonDescriptiveTypes schema of which it is a component. We do this by adding an <xsd:import> element directly after the <xsd:schema> element:

<xsd:import
namespace="http://www.govtalk.gov.uk/people/PersonDescriptives"
schemaLocation="PersonDescriptiveTypes-v1-1.xsd"/>

Note: This assumes that we have downloaded the schema and placed it in the same directory as the employee schema.

By importing the PersonDescriptiveTypes schema, all of the components defined in that schema are available to us as part of the employee vocabulary. Finally, we change

<xsd:element name="Name" type="xsd:string"/>

to

<xsd:element name="Name" type="pd:PersonNameStructure"/>

This enables us to use the full e–GIF compliant data standard for Person Name to express the name of an employee. The final result is as follows:

Example 3.3: Employee.xsd which makes use of the e-GIF standard for Person Name

<xsd:schema targetNamespace="http://www.govtalk.gov.uk/employee"
xmlns:pd="http://www.govtalk.gov.uk/people/PersonDescriptives"
xmlns:film="http://www.govtalk.gov.uk/film"
xmlns:emp="http://www.govtalk.gov.uk/employee"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified" attributeFormDefault="unqualified">

<xsd:import
namespace="http://www.govtalk.gov.uk/people/PersonDescriptives"
schemaLocation="PersonDescriptiveTypes-v1-1.xsd"/>

<xsd:element name="Employee">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Name" type="pd:PersonNameStructure"/>
<xsd:element name="Address" type="emp:AddressType"/>
<xsd:element name="PayBand" type="xsd:string"/>
<xsd:element name="Salary" type="xsd:decimal"/>
<xsd:element name="JobTitle" type="xsd:string"/>
<xsd:element name="JobDescription" type="xsd:string"/>
<xsd:element name="Email" type="xsd:string"/>
<xsd:element name="Department" type="xsd:string"/>
<xsd:element name="Phone" type="xsd:string"/>
</xsd:sequence>
<xsd:attribute name="id" type="xsd:integer"/>
</xsd:complexType>
</xsd:element>

<xsd:complexType name="AddressType">
<xsd:sequence>
<xsd:element name="Line" type="xsd:string" maxOccurs="3"/>
<xsd:element name="Town" type="xsd:string"/>
<xsd:element name="County" type="xsd:string" minOccurs="0"/>
<xsd:element name="Postcode" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>

</xsd:schema>

Address and Personal Details

It is worth noting here that the PersonDescriptiveTypes schema also forms part of a set of schemas available on GovTalk called Address and Personal Details. The preferred e-GIF address standard is the BS7666 Address, the schema for which is also a part of this set. To develop Example 3.3 further we could import the BS7666 namespace and use it to redefine <Address> in the same way that we have for <Name> above. This is the recommended approach but for simplicity, this is left to the reader.

The Address and Personal Details schemas can be accessed via the GovTalk website at www.govtalk.gov.uk/schemasstandards/schemalibrary.asp [External website]

Associating XML Documents with Schemas

We have seen how to create a schema and how to declare a namespace but how do we reference this from our employee instance document? The answer is through the xsi namespace – the standard prefix for www.w3.org/2001/XMLSchema-instance [External website]– which is designed specifically for this purpose. This defines a schemaLocation attribute that is typically used in the root element of an instance document as follows:

<Employee xmlns="http://www.govtalk.gov.uk/employee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.govtalk.gov.uk/employee Employee.xsd"
id="91710">

We have added three new attributes:

  1. The xmlns attribute declares the namespace of this document.
  2. The xmlns:xsi defines the namespace prefix for referencing schemas.
  3. The xsi:schemaLocation has a pair of values (separated by a space): the first is the namespace name of the associated schema and the second is the location of the schema.

Note: This assumes that Employee.xsd is in the same folder as the instance document.

Now that we have tied our instance document directly to a schema, it will validate if we run it through a validating parser.

XML Namespaces is a W3C Recommendation. For more information, refer to Namespaces in XML www.w3.org/TR/REC-xml-names/ [External website]

For your assistance – resources

XML Schema Part 0: Primer
www.w3.org/TR/xmlschema-0/ [External website]

XML Schema Part 1: Structures
www.w3.org/TR/xmlschema-1/ [External website]

XML Schema Part 2: Datatypes
www.w3.org/TR/xmlschema-2/ [External website]

Government Data Standards Catalogue
www.govtalk.gov.uk/gdsc/html/ [External website]

e-Government Interoperability Framework
www.govtalk.gov.uk/schemasstandards/egif.asp? [External website]

Technical Standards Catalogue
www.govtalk.gov.uk/egif/ [External website]

GovTalk Schema Library
www.govtalk.gov.uk/schemasstandards/schemalibrary.asp [External website]

GovTalk Guidance for Devlopers
www.govtalk.gov.uk/schemasstandards/developerguide.asp [External website]

e-Government Schema Guidelines for XML 3.1
www.govtalk.gov.uk/schemasstandards/developerguide_document.asp?docnum=946 [External website]

W3C Namespaces in XML Recommendation
www.w3.org/TR/REC-xml-names/ [External website]

In section navigation