Introduction

Purpose and audience. This document sets forth guidelines for the design of data warehouse tables, for mainframe data extracts, and, consequently, for their documentation. It is based on the Central Information Warehouse (CIW) team's experience and the input of users, developers and data administrators at CU and at other institutions.

The intended audience for this document includes everybody who is interested in the details of the data warehouse: designers, programmers, data and database administrators, and technically-oriented data users. The document's length and vocabulary are due to its attempt to balance technical depth for programmers with relevance for data users. A glossary and a bibliography are provided at the end of the document.

Why have guidelines? There are four major benefits of articulating and then following formal guidelines for the design of a data warehouse and the steps that populate the data warehouse. These data design guidelines are intended to help:

Improve the outcomes of each design step. The outcome of each step is described, so that everybody who participates in the process has a more accurate and complete understanding of each of the steps that come before or afterward.

Make inputs more substitutable. A clear description of the design steps and their implied assumptions make it easier to recognize and deal with variations in the inputs to a step (e.g., variations in the complexity of subject areas or differences in data sources).

Make outputs more reliable. Well-understood design steps that process well-understood inputs are more likely to produce well-understood outputs (whether they are Oracle tables or mainframe data extracts). Standards that were applied to the product's design and development become part of the product's documentation. Information products will be used more reliably if they are thoroughly documented.

Make maintenance more economical. Maintenance tends to be on-going and it implies some re-design. Standardization, documentation and other investments in good practice during the original design process pay off during maintenance because the programmer who embarks on a maintenance task has a much better map of the domain as a whole.

By defining terms and amplifying important concepts, these data design guidelines seek to bridge the communications gap between the user's and the developer's perspectives. Both the user and the developer benefit as a result.

Guidelines not a methodology. The preceding discussion might suggest that these guidelines are the equivalent of a design methodology. There are similarities, but these guidelines focus mainly on the outcomes of the major steps and do not prescribe methods within a step. When methods are described or assumptions about the processes are made, they are the result of the need for clarity and organization, not the result of conviction that the "right methodology" is at hand! Many of these assumptions are stated in the next three sections of the paper.

In order to focus on the design of Oracle tables and the preparation of data to populate those tables, many related topics are not discussed in these guidelines. Where specific references to such discussions exist, references are provided in the bibliography at the end. For example, naming guidelines for the Oracle tables, titled "Warehouse Naming Guidelines," have been developed by a multi-campus group and are maintained and published by IRM via a gopher server (in IRM's "Guidelines and Standards" directory on infonet.colorado.edu). Guidelines are yet to be written on several subjects; see the bibliography at the end of this document. Documentation of mainframe applications is not cited but is a very important source of information about the "live" data in an application system.