Metadata standards
Eager to facilitate data communication between organizations and software systems, and to improve the quality of statistical documentation provided to users of data, the data archive community has developed a set of metadata standards. These metadata standards provide a structured framework for organizing and disseminating information on the content and structure of statistical information. To take full advantage of the web technology, most standards are defined in the XML language. The Data Documentation Initiative specification (or DDI) is a standard dedicated to the documentation of microdata. The Dublin Core provides a standard for the documentation of related resources.
The XML Language
XML stands for eXtensible Markup Language. It was developed as a common tool to structure information to be shared on the Web and between software systems. XML is a way of tagging text for meaning, instead of appearance. In other words, XML can be used to organize the content of text by tagging with meaningful information. Although the "tags" are conceptually the same as the "fields" in a database in terms of organization, the major difference between XML files and database files is that the former are regular text files which can be viewed and edited using any standard text editor. The file can be searched and queried like a regular database using tools like Xpath or Xquery, and edited using Xforms. (A web-based tutorial on these tools can be found at http://www.w3schools.com/xml.) Just as the content of a database can be converted into a report, XML documents can be read and transformed by other software applications into user-friendly formats such as spreadsheets, PDF files or Web pages.
The example below shows how textual information about a survey could be presented in XML.
The same information converted into XML using DDI tags would look like this:
<titl>Multiple Indicator Cluster Survey 2005</titl>
<altTitl>MICS</altTitl>
<AuthEnty>National Statistics Office (NSO)</AuthEnty>
<fundAg abbr="UNICEF">United Nations Children Fund</fundAg>
<collDate date="2005-01" event="start"/>
<collDate date="2005-03" event="end"/>
<nation>Popstan</nation>
<geogCover>National</geogCover>
<sampProc>5,000 households, stratified two stages</sampProc>
<respRate>98 percent</respRate>
The use of tags is particularly powerful when a community of users agree on a common set of tags (such as the DDI or Dublin Core standards). Adoption of a common set of XML tags offers major advantages in documenting microdata including:
- Creation of a comprehensive "checklist" of useful metadata elements;
- Potential to assess the content of a file by determining whether particular tags are, or are not, within that file;
- Creation of a dataset catalog which can be queried for key metadata elements;
- Potential to transform the file into more user-friendly formats. XML files can be converted into HTML, PDF or other types of documents using XSL Transformations. They can also be exchanged across networks or the Internet using web services or SOAP. An example of the application of "XSL Transformation" to the XML file above is the following HTML web page:

The Data Documentation Initiative (DDI)
The Data Documentation Initiative (DDI) is an effort to establish an international XML-based standard for microdata documentation. Its aim is to provide a straightforward means to record and communicate to others all the salient characteristics of micro-datasets. The DDI specification is a major transformation of the once-familiar electronic "codebook," which retains the same set of capabilities but greatly increases the scope and rigor of the information contained in it. The DDI metadata specification originated in the Inter-university Consortium for Political and Social Research (ICPSR), a membership-based organization with over 500 member colleges and universities around the world. It is now the project of an alliance of institutions in North America and Europe. The member institutions comprise many of the largest data producers and data archives in the world. An important goal of the initiative is to become an ISO standard. The most recent version of the DDI specification is version 3.0. Version 2,1 is however the most widespread.
By creating a consistent framework for microdata documentation, the DDI has the following features:
-
Interoperability
DDI-compliant documentation can be exchanged and transported seamlessly, and applications can be generically written, because the documents are homogeneous. -
Richer content
The DDI provides data analysts with broader knowledge about data content, because the DDI initiative provides a comprehensive set of elements that can describe micro-datasets as completely and as thoroughly as possible. -
Multi purpose documentation
A DDI codebook can be restructured to suit different applications, because it contains all the information necessary to produce different types of output. -
On-line analytical capability
DDI documents can be easily imported into on-line analysis systems, rendering datasets more readily usable by a wider audience. This is made possible because the DDI markup extends down to the variable level and provides a standard uniform structure and content for variables. -
Search capability
Field-specific searches across documents and studies are made possible, because each of the elements in a DDI-compliant codebook is tagged in a specific way.
Coverage
The DDI specification has been designed to fully encompass the kinds of data generated by surveys, censuses, administrative records, experiments, direct observation, and other systematic methodologies for generating empirical measurements. In other words, the unit of analysis could be individual persons, households, families, business establishments, transactions, countries, or other subjects of scientific interest. Similarly, observations may consist of measures taken at a single point in time in a single setting, such as a sample of people in one country during one week, or they may consist of repeated observations in multiple settings, including longitudinal and repeated cross-sectional data from many countries, as well as time series of aggregate data. The DDI specification also provides for full descriptions of the methodology of the study (mode of data collection, sampling methods if applicable, universe, geographical areas of study, responsible organization and persons, and so on).
Structure
The DDI specification permits all aspects of a survey to be described in detail: the methodology, responsibilities, files and variables. It provides a structured and comprehensive list of hundreds of elements and attributes that may be used to document a dataset, although it is unlikely that any one study would use all of them. However, some elements, such as "Title," are mandatory (and must be unique). Other elements are optional and can be repeated, for example "Authoring Entity/Primary Investigator", since it includes information on the person(s) and/or organization(s) responsible for the survey.The DDI elements are organized in five sections:
-
Section 1.0: Document Description
A study (survey, census or other) is not always documented and disseminated by the same agency as the one that produced the data. It is therefore important to provide information (metadata) not only on the study itself, but also on the documentation process. The Document Description consists of overview information describing the DDI-compliant XML document, or, in other words, "metadata about the metadata". -
Section 2.0: Study Description
The Study Description consists of overview information about the study. This section includes information about how the study should be cited, who collected, compiled and distributes the data, a summary (abstract) of the content of the data, information on data collection methods and processing, and so on. -
Section 3.0: Data File Description
This section is used to describe each data file in terms of content, record and variable counts, version, producer, and so on. -
Section 4.0: Variable Description
This section presents detailed information on each variable, including literal question text, universe, variable and value labels, derivation and imputation methods, and so on. -
Section 5.0: Other Material
This section allows for the description of other materials related to the study. These can include resources such as documents (questionnaires, coding information, technical and analytical reports, interviewer's manuals, and so on), data processing and analysis programs, photos, and maps. However, the Dublin Core Metadata Initiative (described below) is better suited for the Toolkit requirements.Some useful references
"Data Documentation Initiative: Toward a Standard for the Social Sciences", by Mary Vardigan (ICPSR), Pascal Heus (ODaF), Wendy Thomas (MPC), presented at the 3rd International Digital Curation Conference, Washington, DC, Nov 2007 (Presentation).
"Data Documentation Initiative: Toward a Standard for the Social Sciences", by Mary Vardigan (ICPSR), Pascal Heus (ODaF), Wendy Thomas (MPC), International Journal of Digital Curation, Vol 3, No 1, Aug 2008 (PDF)
The Dublin Core Metadata Standard (DCMI)
The DCMI Metadata Element Set (ISO standard 15836), also known as the Dublin Core metadata standard, is a simple set of elements for describing digital resources. This standard is particularly useful to describe resources related to microdata such as questionnaires, reports, manuals, data processing scripts and programs, etc. It was initiated in 1995 by the Online Computer Library Center (OCLC) and the National Center for Supercomputing Applications (NCSA) at a workshop in Dublin, Ohio. Over the years it has become the most widely used standard for describing digital resources on the Web and was approved as an ISO standard in 2003. The standard is maintained and further developed by the Dublin Core Metadata Initiative - an international organization dedicated to the promotion of interoperable metadata standards.
A major reason behind the success of the Dublin Core metadata standard is its simplicity. From the outset it has been the goal of the designers to keep the element set as small and simple as possible to allow the standard to be used by non-specialists. The purpose of the standard is to make it easy and inexpensive to create simple descriptive records for information resources, while providing for effective retrieval of those resources on the Web or in any similar networked environment. In its simplest form the Dublin Core consists of 15 metadata elements, all of which are optional and repeatable. The 15 elements are:
|
|
|
ISO 11179 - Information Technology - Metadata registries (MDR)
The International Standard ISO/IEC 11179-1 was developed by the Joint Technical Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 32, Data management services. "ISO/IEC 11179 describes the standardizing and registering of data elements to make data understandable and shareable. Data element standardization and registration as described in ISO/IEC 11179 allow the creation of a shared data environment in much less time and with much less effort than it takes for conventional data management methodologies." (Source: ISO-IEC 1999, available at http://metadata-stds.org/11179-1/ISO-IEC_11179-1_1999_IS_E.pdf)
Statistical Data and Metadata Exchange (SDMX)
Focusing on time series and indicators, SDMX is the result of a joined effort from the Bank for International Settlements, the the European Central Bank (ECB), EUROSTAT, the the International Monetary Fund (IMF), the the Organization for Economic Cooperation and Development (OECD), the the United Nations (UN), and the and the World Bank (WB) to create an XML specification to support the exchange of aggregate data and metadata. SDMX provides three types of statistical metadata standards: standards for data formats, standards for metadata and a registry-based architecture to implement these standards and to exchange data between systems.
One of the requirements of SDMX was the awareness of other metadata specifications such as the Data Documentation Initiative (DDI). Any of the DDI metadata - which emphasizes archival metadata and micro-data, rather than aggregate data - is exchangeable in an equivalent SDMX metadata format. This ensures inter-operability of metadata across namespaces.
For more information on the relationship between SDMX and the DDI, see:- "DDI and SDMX", by Arofan Gregory (ODaF) and Pascal Heus (ODaF), IDSC Workshop on comparability of DDI/SDMX, Wiesbaden, Germany, June 18th 2008 (Powerpoint presentation)
- "DDI and SDMX: Complementary, Not Competing, Standards", by A. Gregory and P. Heus, Open Data Foundation, July 2007 (Paper)
ISO 19115
Maintained by the ISO/TC211 technical committee for Geographic Information/Geomatics, ISO 19115 defines the schema required for describing geographic information and services. It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data.
ISO 19115 is one of the core standards of the United Nations Geographical Information Working Group (UNGIWG), a network of professionals working in the fields of cartography and geographic information science to building the UN Spatial Data Infrastructure needed to achieve sustainable development.
ISO 19115 is also a recommendation of the the Open Geospatial Consortium.
