Survey cataloguing

Rationale

Sharing survey microdata with legitimate users offers many benefits, including the diversity of research work, the acceptability of data, the quality of data, and others (see Data dissemination - Benefits, risks and costs). To maximize these benefits, interested users need to be informed about the existence and characteristics of the datasets.

Many potentiel users will have very little if any information about the available datasets. In most cases, their search for data will not be in the form of "Where can I get the Bhutan Living Standards Survey 2003 data?" More often, the search will be in the form of "Where can I find data on rural household consumption of rice in India for the period 2000-2005, and who should I ask for permission to use these data?". Answering such questions requires that good metadata be publicly available, preferably in the form of a searchable catalog.

Characteristics of a good survey catalog

From the user point of view:

  • Compliant with international metadata standard
  • International XML metadata standards such as the considerably facilitate the production and maintenance of such catalogs. XML metadata are "structured", i.e. they are by nature searchable.

  • Based on recent and standard technology (XML)
  • Provides rich metadata, including at the variable level
  • Survey catalogs become particularly relevant and powerful when the survey metadata provides not only a detailed description of the survey itself (with information on title, primary investigator, sampling, date of data collection, topics, geographic coverage, etc), but also a detailed description of each variable (with information on variable name and label, categories, literal question, interviewer's instructions, definitions).

    A variable-level catalog can be relatively easily established using the DDI metadata standard, and the IHSN tools (in particular the IHSN Microdata Management Toolkit and the free NAtional Data Archive (NADA) application).

  • Provides user-friendly search functionalities (full text search)
  • Provides clear information on the policy and procedure for accessing the data
  • Provides a list and direct access to reference materials (questionnaires, manuals, reports)
  • Incudes a "search by topic" compliant with an international thesaurus
  • To facilitate the exchange of information among catalogs, the data archive community has developed thesaurus to describe the topics covered by the datasets listed in their respective catalogs. A thesaurus is a set of terms or concepts used to describe objects like datasets, variables, books etc. The terms in a thesaurus are normally organized as a tree or hierarchy with broader terms being parents to narrower terms. Usually, a thesaurus will also include parallel terms and synonyms allowing users to find what they are looking for, even when they are not using the preferred terms.

    Many archives use a thesaurus when adding keywords at the study level or concepts at the variable level. The use of a thesaurus will encourage consistency by making sure that the same terms are selected when describing identical objects. Moreover, if users have access to the thesaurus when searching for data, there is a greater chance that they will use terms and concepts returning the most relevant list of hits.

    An example of the use of a thesaurus is the catalog maintained by CESSDA, an umbrella organization for social science data archives across Europe.

    One of the most elaborated thesaurus is HASSET (the Humanities and Social Science Electronic Thesaurus) developed by the UK Data Archive. HASSET is an integrated part of the UK Data Archive's data catalogue, which can be seen from the screen-shot below:

    When entering and submitting the term income, the following information about the relationship between this and other terms will appear:

    From this point on, users can either look up one of the broader, narrower or related terms, or simply decide to include all or some of them in the query.

    A slightly reduced version of HASSET has recently been translated in several languages for use in an integrated data catalogue for all European social science data archives. This will allow users to search for data documented in a variety of languages using their preferred language. The thesaurus will automatically translate the selected terms in all relevant languages and perform the search accordingly.

  • Linked to other catalogs (network/portal)

From the catalog administrator's point of view:

  • Provides a secure environment for storing and sharing data and metadata
  • Provides a "users' requests" and "user's management" tool to receive and respond to data requests and information queries;
  • Provides a solution for sharing public use files and licensed files
  • Generates admin reports on access requests received/processed; most popular surveys/documents; keywords used for searching data; etc.
  • Developing a survey catalog

    The IHSN provides various tools (the IHSN Microdata Management Toolkit and the National Data Archive (NADA) application)) and guidelines allowing data producing agencies to easily develop DDI-compliant survey/census catalogs.

    The Accelerated Data Program (ADP) provides technical and financial support to selected countries in the development of such survey databanks.