What does a row occurrence represent? To understand the data in the data warehouse and to communicate about it requires an understanding of what the occurrence of a row in a warehouse table represents in the real world. For many tables having operational significance, it is easy to map between data on the mainframe and corresponding rows in a data warehouse table. This idea is illustrated in Exhibit 2, which also suggests that many rows in warehouse tables correspond to record segments on the mainframe. There are, however, limits to this correspondence. In many cases a mainframe file will provide for a fixed number of segments, some of which are blank and therefore should not appear in the warehouse. In other cases a segment is marked as inactive and is "blank" or "missing" from the application's perspective, even though there are data values in the segment; such segments usually should not appear in the data warehouse.
Degree of normalization. In a procedural context, where a program is directing a cursor to loop through the segments of a record, the repeating segments of mainframe files make a lot of sense. In a non-procedural or relational context, where SQL is the main data manipulation and extraction language, normalization is preferred. In the case of "live" databases that are being updated by multiple users and multiple programs, normalization is essential. Other considerations may apply in the case of a data warehouse: when the users of the data must perform a join to make sense of the data (e.g., a job class code is useless without a job class title), performance considerations may require that some tables be denormalized in advance.
Tables that describe history. Representing history can quickly become more complex than most potential data users can tolerate. This problem is especially severe if the application that produces the data has not operationalized history concepts in such a way that they are easily remembered and used. In addition, history tables can quickly become huge if thoughtful decisions are not made about what is really significant. Again, if the application that produces the data has implemented history concepts, the users of the application are likely to have determined which data elements are really essential to preserve. This section summarizes some ways of representing history using the person table from Exhibit 1.
Perhaps the simplest kind of history table takes a transaction view of change, recoding change in one data element at a time. The virtue of this approach is that all changes affecting one person or one data element can be seen directly. One drawback to this approach is that the simultaneous change of two or more data elements at a same time is not represented directly and can be cumbersome to deal with. For example, it would take many rows to show the fact that a new employee was hired.
NAME CHANGE PREVIOUS CURRENT CHANGE-DATE SMITH PHONE 303 492-2473 303 492-9473 20DEC89
A second approach takes a snapshot of all the elements in a table when any one of a specific group of elements has changed. Start and stop dates show when change has occurred. The virtue of this approach is that all the other data elements in the row (whether they changed or not) give a context for those that did change. Another benefit of this approach is that the "history version" of the table looks a lot like the corresponding "regular version" of the table. Extra data elements or fairly complex retrieval code is necessary to pinpoint exactly which data elements have changed.
NAME CAMPUS DEPT PHONE CURRENT START STOP SMITH CENTRAL IRM 303 492-9473 Y 01JUL93 SMITH CENTRAL UMS 303 492-9473 N 20DEC89 01JUL93 SMITH CENTRAL UMS 303 492-2473 N 20OCT88 20DEC89
A third approach uses one row to represent a fixed time period (e.g., year) with multiple data elements representing the characteristics of the entity "as of" that period in time. The virtue of this approach is that it can make queries that do counting much easier than the other approaches (e.g., "how many X's were there in 1993?" or "how many years did the X's have Y characteristic?). Change that occurs entirely within the fixed time perioid can be difficult with this approach.
NAME CAMPUS DEPT PHONE YEAR SMITH CENTRAL IRM 303 492-9473 1994 SMITH CENTRAL IRM 303 492-9473 1993 SMITH CENTRAL IRM 303 492-9473 1992 SMITH CENTRAL UMS 303 492-9473 1991 SMITH CENTRAL UMS 303 492-9473 1990A fourth approach uses one data element represent a particular characteristic in each time period. Given a specific domain of questions, this approach can make certain queries extremely easy and very useful. It requires a fully defined view of the problem and a careful specification of the data.
NAME CAMPUS DEPT-IN-94 DEPT-IN-93 DEPT-IN-92 SMITH CENTRAL IRM IRM UMSA final note on history is that it complicates the use of foreign keys. For example, if job class is an element in an appointment history table, care must be taken to understand the meaning and implications of changes in the job class table's data elements. Obviously, it causes problems to drop a row in the job class table.
Table names. Intuitive table names (and good table descriptions that expand on those names) are a very important pre-requisite for understanding and then naming entity attributes or columns. Table names are the most complicated and costly names to change in the data warehouse, so great care should be taken in naming tables correctly in the design phase. Although the "Warehouse Naming Guidelines" focuses more on column naming than table naming issues, good naming practices in general are discussed in that document.
Correspondence between Oracle tables and mainframe files. The normalization example in Exhibits 1 and 2 suggest how, despite many differences in design assumptions, tables in the data warehouse can roughly correspond to records or record segments used in the mainframe application system. From this perspective the development of data extracts on the mainframe is a process of mapping mainframe data elements onto a new structure in the data warehouse. However, the warehouse goals of integrating data over time and across subject areas usually require that the tables in the warehouse be structured differently from the data in mainframe applications.
Table keys. The basic logical relationships between tables has to be worked out during the E-R diagramming process. Table keys that are fixed in the data warehouse physical design and the mainframe data extracts are based on the logical relationships that are established during the E-R diagramming process. The following rules apply to the design of keys:
The key of an Oracle table is similar to the key of a mainframe file in that an Oracle key identifies a unique record. The key of a mainframe file, therefore, should never be discarded.
When there is a fairly direct mapping from the mainframe application to the data warehouse tables, the key to a parent segment on the mainframe side will match the table key on the Oracle side. (When a data warehouse table contains historical data, the mainframe key may be augmented with a date to distinguish records extracted, for example, on different dates.)
To link the parent segment with the child segments, the key to the parent segment on the mainframe side becomes a part of the key for the child table on the Oracle side. If there is no data element in the child segment that will serve to uniquely distinguish the segments, the logical key to the child table is formed by concatenating the parent key with an arbitrary sequence number. (The name of an element that is used to augment the key in this way should end with "_SEQ_NUM" to make it clear that on the mainframe it had a somewhat arbitrary meaning.)
Rows. The data warehouse tables, the mainframe data extracts, and the documentation provided to data warehouse users should all agree on specific rules that state the conditions under which a mainframe record or segment should appear as a row in the data warehouse (and the conditions under which a record or segment would be dropped). The data extract and loading process must be explicit as to whether new rows are inserted or replaced. Since these rules are specific to each subject area, they are not discussed in these guidelines. However, these rules should serve to meet the fundamental objective of the data warehouse: permitting users to "obtain facts about the university's business." For example, "test cases" may be necessary in a production environment, but do not belong in a data warehouse. Just as blank records (e.g., all data elements are zero or blank) on the mainframe should not be transported to the data warehouse, records or segments that are inactive, marked for deletion, or logically null for some other reason (e.g., not within effective dates) should not appear in the data warehouse.
Referential integrity. Referential integrity, the requirement that data values designed to reference row keys in other tables must indeed reference rows that exist, can be implemented via the application (procedural referential integrity) or the RDBMS (declarative referential integrity). Declarative integrity prevents any screen or query from entering, changing or deleteing data that violates integrity constraints. Procedural integrity puts the burden on the screens or queries to prevent violation of the rules. For example, the requirement that every value of JOB_CLASS_CODE (e.g., in an appointment table) must be found in the job class table could be enforced procedurally or declaratively. The data warehouse is updated in batch mode, usually a complete row at a time. Since those who are loading data into the data warehouse are otherwise not involved with the data subject area, and do not like getting messages about violations of integrity constraints, the responsibility for referential integrity must lie first with the mainframe application system that collects and manages the data and second with the data extract process. The data extract process should check on any referential integrity constraints (and produce reports that notify the system manager), since such problems are fairly difficult for warehouse users to deal with. The design of the extract process might consider adding rows to some tables to make it explicit that, for example, the key's value exists in the database but a label is not available for a particular code.
Another way of solving referential integrity problems is to denormalize a table as part of the extract process. This approach can have a positive performance impact as well.