by Richard K. Roth, Principal, Price Waterhouse LLP
Eric L. Denna, Ph.D., Warnick/Deloitte & Touche Faculty Fellow, Brigham Young University, Marriott School of Management
This is reprinted from the February 1996 issue of Manufacturing Systems (C) Chilton Publications, All Right Reserved. [More about this time can be read in Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 16. Data Warehousing.]
Introduction
The concept of Decision Support Systems (DSS) driven by data warehouses is a powerful one in our “data rich but information poor” world. In any given situation, a wide range of computer hardware and software technologies potentially could be all or part of a solution. To avoid solutions that are unnecessarily costly, or solutions that simply do not work, guiding principles for allocating functions to computing-platform technology are important.
The wide range of architectural issues that must be addressed makes it easy to focus on some areas to the exclusion of critical considerations. This article presents a framework for defining DSS/data-warehouse requirements, with corresponding principles for making computing platform assignments, that help avoid the pitfalls. Some recommendations for getting started without getting derailed also are provided.
The Framework and the Problem
Decision Support Systems (DSS) can be thought of as having four basic components:
- Data Capture
- Data Storage and Retrieval Engine (data warehouse)
- Data Analysis
- Presentation Layer
The presentation layer very often gets the most attention at the outset of a DSS discussion or project. It is the easiest to touch and feel and provides the easiest visualization of how problems could be posed and end results communicated. However, the majority of the business purposes for a DSS are addressed by the first three components. And the majority of the cost (85% or more in Bill Inmon’s experience[1]) is incurred in storing and accessing the detail data that drive the DSS applications — component #2. An effective presentation layer is important, because DSS accessibility by business analysts without the routine intervention of technical personnel makes a DSS more useful for a broader range of problems. But, for problems big enough to be worth solving, even the most cumbersome presentation layer is better than no solution at all.
In an extremely simple case, one piece of software (e.g., spreadsheet) running on one hardware platform (e.g., PC) could serve the requirements for all four components. However, most real-world problems require an integrated solution consisting of several hardware and software technologies. Choosing the right technologies depends on the nature of the problems being addressed. And the problems vary along the lines of the four basic DSS components described above. It is important to consider the requirements for each component separately, because lumping them under a single heading makes it difficult to choose between alternatives.
Data Capture
DSS applications may use data from a variety of internal or external sources. In most cases, the primary and high volume data will come from operational systems used to run the business on a day-to-day basis. The detailed data at the bottom of operational systems (captured at great cost then archived — by design) are the primary raw materials for understanding the patterns of interactions between suppliers, employees, business processes, customers, markets, products, services, pricing decisions, and financing alternatives.
The architecture of operational systems makes them good sources for data to drive DSS applications, but the same architecture makes them terrible models for DSS components 2-4. Operational systems tend to be functionally oriented and optimized for “getting today’s work off the table.” Operational data tends to be very detailed dealing with individual or a series of related transactions that have a limited life-cycle, from an operational standpoint. Management information system components of operational systems tend to support the limited-life-cycle clerical and management aspects of their respective narrow functional areas. But they do not support a “retrospective” subject-area or enterprise view of transaction processing over time, which is the general goal for DSS.
Data in operational systems is usually spread over multiple technology platforms (including paper files), and is maintained according to different update cycles and classification structures. Part of developing a DSS includes rationalizing or converting the operational data so that it is comparable across sources where different aspects of the same or related underlying transactions are being represented. Regardless of the technical platform or architecture for the DSS, the task of rationalizing data from various sources is the same, and it should not be underestimated.
It usually will be necessary to capture some new data to support the goals of a DSS initiative. It is important to remember that capturing new information is inherently an operational problem, not a DSS issue per se. As such, it should be treated as a problem for integration with existing operational systems, if practical or logical, or as a new operational system. System capabilities to fulfill new data requirements should be designed using an operational application architecture appropriate for the business processes from which the new information will be captured. A common mistake is to mix new data requirements that surface as part of defining a DSS environment with the requirements for components 2-4, which tends to bias (confuse) hardware and software platform choices.
Data Storage and Retrieval Engine (data warehouse)
Following Bill Inmon’s model for a data warehouse[2], data is maintained in four major classes:
- Current detail
- Old detail
- Lightly summarized
- Highly summarized
Current and old detail corresponds most closely to the detailed transaction-level data from the operational systems discussed above. Lightly summarized data amount to reductions of the detail appropriate for the general requirements of a particular department’s DSS applications. Highly summarized data amount to further reductions of the lightly summarized data for specific DSS modeling requirements. The primary difference between each of the four classes of data is VOLUME. And volume is the primary (although not only) factor that determines the appropriate technology platform for storing and retrieving data in a warehouse.
Classes 1 and 2 represent the highest data volumes and the richest source of information.[3] Since the volume of detail data is the primary cost driver for data warehouses, the largest impact on the cost (or cost avoidance) of a DSS environment can be achieved by making the right technology platform choices in this area.
Current detail usually is thought of as being roughly two years of history and maintained on magnetic disk devices. Old detail is roughly assumed to be up to 10 years of history and maintained on tape or other low cost storage devices. Depending on the scope of the warehouse, nature of the business and size of the enterprise, volumes in the multi-million/multi-billion record range should be anticipated for annual additions to enterprise-wide detail data. Storage space required can range from hundreds of gigabytes to terabytes.
Lightly summarized and highly summarized data can vary greatly in degree of volume reduction from the corresponding detail. Some DSS applications require routine reprocessing of large subsets of the detail (customer scoring applications would be an example). Probably most applications, however, can be satisfied with a very small subset of the data summarized at one or two orders of magnitude of reduction (modeling financial performance for individual products or categories of products over time would be an example). Because of the potential volume reductions, the platform options open up substantially when dealing with lightly and highly summarized data.
For any large volume of data (large could be as small as a few hundred thousand rows or as large as a few billion rows), the nature of the processing required, in combination with the absolute number of rows/records, is the driver for platform choice. Three basic hardware architectures and two basic file organizations are available:
1. Hardware architectures
A. Mainframe — Offers very high data bandwidth (4.6-17 megabytes per second on each of up to 256 channels per machine) in a shared memory, parallel processing environment. Excellent architecture for situations where large amounts of data on disk or tape must be evaluated for selection or re-sequencing purposes. Excellent environment for situations where enterprise-wide access is important. Drawbacks are: 1) the relatively high cost of disk storage (presently about 4:1 — but coming down quickly — compared with server technology) and processing power (MIPS); and 2) the relatively thin market for “friendly” presentation layer tool choices. This environment is often referred to as “data rich but MIPS poor”.
B. Server — Offers cheap processing power (MIPS) and memory, but relatively low data bandwidth (0.1-1.25 megabytes per second with only one channel per machine)[4] in a single processor environment. Excellent for situations where all data can be loaded into memory before processing (e.g., personal applications like spreadsheets, word processing and graphics) or where relatively small amounts of data must be accessed for selection or re-sequencing purposes. The cheap processing power and memory make it cost-effective to consume these resources in the long instruction path lengths necessary for implementing “friendly” graphical user interfaces. As a result, the market is relatively rich in “friendly” presentation layer tool choices. This environment is often referred to as “MIPS rich but data poor.”
C. Symmetric Multi-Processors (SMP) — Uses cheap server technology running in a parallel processing environment (currently 4-20 processors per machine) to scale up the power of server technology described above. Excellent environment for accessing large stores of data on magnetic disks through indices where instruction path lengths are long but the amount of data that must be moved from the disks to the processors for selection or re-sequencing purposes is relatively small.
2. File organizations
A. Sequential — Primary method for storing old detail in operational systems and warehouses. Also, the primary file structure used for transferring data from various platforms to the data warehouse and between subsystems of a DSS environment. Depending on the configuration, a mainframe can read sequential files at a rate of 1 to 20 or more gigabytes per minute. Server technology can read sequential files at rates of .002-.1 gigabytes per minute. a difference of 1 to 4 orders of magnitude, depending on specific machine architectures and configurations. Excellent method for storing data warehouse information where many data elements can serve as the primary selection criteria and no practical amount of indexing will materially reduce the need for sequential table scans if stored in a relational file organization
B. Relational — Primary method for storing operational data to support on-line transaction processing. Also, the preferred method for storing lightly, and especially, highly summarized data in a warehouse application where usage tends to result in one or a few data elements being the primary selection criteria. The primary advantage of this file organization is that relational data can be accessed through SQL, which is the primary method through which most user “friendly” tools communicate with data warehouses. Disadvantages are that random access to data is very slow — on the order of 6-12 megabytes per minute on mainframe or server technology. In addition, the instruction path length is very long for resolving SQL instructions, which consumes substantial processing power (MIPS). As a result of the long SQL instruction path lengths, data warehouse applications where relational organizations are appropriate tend to be implemented on server or SMP technology where cheap MIPS are available.
In summary, platform choice for the storage and retrieval engine depends on how much data there is and what will be done with it. Current and old detail tends to be high in volume and used to feed specialized departmental and individual warehouses. Detail data often will be accessed sequentially, because many elements can be used as the primary selection criteria. Therefore, mainframes tend to be good platforms for the detail portions of data warehouses. Departmental and individual warehouses tend to be lower-volume reduced answer sets of the detail data. These warehouses often are created to support a particular focus of analysis and, consequently, tend to be good candidates for relational storage and access through indices. Consequently, server/SMP technologies can be good platforms for the departmental and individual portions of data warehouses.
Data Analysis
Once a particular DSS problem is formulated, one or more analytical models may be appropriate. The most common DSS problem requires selecting an answer set from the warehouse that meets specified criteria. Then, according to other criteria, the preliminary answer set is augmented, transformed, re-sequenced, summarized/counted, formatted and displayed as the result. The result is usually in the form of a cross tabulation or often a single line. Basically, this kind of analytical framework is known as data reduction or, more simply, reporting. The storage and retrieval component of the data warehouse always has a reporting framework of some competency as an integral part of its functionality. However, the requirements of any particular DSS problem may require data reduction/reporting functions not available directly from the storage and retrieval engine.
Separate reporting and data analysis facilities may be necessary to accomplish particular objectives. More advanced data analysis requirements would include statistical and econometric models that employ regression, factor analysis or other trend, pattern recognition and exception analysis techniques. For small amounts of numerical data, predictive models can be implemented using work-station-based neural network technology. For large amounts of data with complicated business rules, the mathematics of fractal geometry can be employed to deal with data compression issues to facilitate storage and processing, usually on a mainframe. Using these kinds of techniques almost always requires transferring an answer set provided by the storage and retrieval engine’s reporting mechanism to a separate software package (which may be on a different platform).
It is tempting to bundle the storage and retrieval engine with the data analysis component if DSS is viewed as primarily a reporting problem. However, thinking about the data analysis problem in a broader context clarifies the difference between the two components and demonstrates the importance of separately defining DSS requirements for each area.
Presentation Layer
The presentation layer is the mechanism through which a DSS user supplies the criteria, parameters and instructions that the DSS components use in their processing. Also, the presentation layer is the delivery mechanism for results produced by the DSS programs.
The image that comes to mind of a DSS presentation layer includes a powerful workstation configured with a high resolution color monitor running a graphical user interface (GUI) displaying charts and numbers that can be traversed using point-and-click drill-down features. Ideally, the presentation layer would make use easy for non-technical people and would not require that they have a detailed understanding of the underlying data in the warehouse. The user interface would be intuitive so that learning and use without instruction manuals is facilitated. Further, it would be good if it provided fully integrated and seamlessly automated control over all DSS components, from access to data captured by operational systems through delivery of query results. Complete user control over DSS facilities from the presentation layer would mean that technical MIS personnel would not need to participate in operation of the system.
Notions of the “ideal” notwithstanding, it is important to remember that the result of a DSS query may be one or a few numbers. It may be an automated file to be used in some other application. Frequently, the output will be a lot of numbers that are more useful if printed in rows and columns on paper. Where non-technical users are concerned, they are more resourceful than they generally are given credit for. People with real problems to solve can deal with an interface more complicated than just anyone could use.
Reality is that very few things are totally integrated and automated — the current state of the art in DSS presentation layers in particular and information technology in general certainly are good examples. One reason is that increasing levels of integration and automation lead to increasing complexity and cost that simply is not worth it in many cases. Another reason is that DSS presentation layer technology is relatively new. We are a long way from the day when users who do not know what to ask for can “just get what they want” through a holistic interface. And MIS people will be in the middle of non-trivial DSS applications because the hardware and software components that make up the systems will remain a complicated specialty for the foreseeable future.
The primary reason for building a DSS is to provide a mechanism for understanding the patterns buried in otherwise inaccessible data. The objective is not to use client/server technology, give analysts a GUI, use SQL everywhere or automate and integrate everything in sight. Operational systems are capturing huge amounts of information at great cost that is potentially valuable but not being used. DSS applications that exploit that information can be developed relatively quickly as long as tools available are applied with common sense in proportion to the nature of the problems being addressed.
Presentation layer characteristics are probably the most diverting technical distractions that stop DSS solutions from getting implemented, or even started. Presentation layer desires often lead to platform decisions that shoehorn DSS functionality into technology not fit for the purpose — the computer equivalent of building skyscrapers out of tinker toys.
Requirements for the presentation layer should be defined after addressing the first three DSS components described above. This minimizes the tendency to get distracted from the primary objective and the temptation to allocate functionality to sub-optimal platform technologies.
Getting Started
Most executives are inherently comfortable with the notion of a data warehouse and DSS. It seems logical that all the money spent on capturing detail transaction-level data ought to pay off for something more than filling orders, collecting receivables and getting the bills paid. The idea that there is something to be learned from a thoughtful analysis of all that history is a natural one.
The first hurdle usually comes when trying to cost justify the first DSS application. The history of MIS in general is one great in expectations, high in cost, fraught with problems, late on delivery and mediocre in results. The comfortable idea of a DSS sometimes becomes an anxious thought once a specific proposal gets formulated.
Anxiety notwithstanding, many companies have found that this is an idea that actually works. It is almost certain that some fundamental assumptions about customers, suppliers, markets, etc. will not be supported by the historical record. DSS applications can be the catalyst for identifying and justifying fundamental changes in strategy or tactics — Product services organizations have fundamentally changed their maintenance agreement packages and pricing; manufacturers have identified subtle causes for out of bounds product failure rates; direct mail operations have quit sending promotions to unresponsive customer segments; financial institutions have developed radically more lenient credit exception policies; and retailers have been able to validate (or not) their view of fashion trends. It usually proves worth doing if the development pitfalls are avoided.
To get over the first hurdle and avoid the pitfalls, we recommend the following:
1. Pick a problem that is worth solving for which virtually all of the data necessary to drive the data analysis is available from one or a small number of existing operational systems. Do not burden the DSS project with a non-trivial operational system re-engineering project.
2. Prove the value of doing retrospective data analysis for the problem on the table. The data analysis component is where the value is delivered. The other three DSS components represent infrastructure that makes it possible to roll out DSS on a larger scale. Do not worry about metadata for source systems or the warehouse. Do not worry about the long run architecture for the enterprise-wide detail warehouse. Do not worry about presentation-layer standards. View this problem as an individual or small departmental data warehouse object case. Get a hard dollar cost/benefit success for DSS that will relieve some anxiety in downstream cost justification proposals.
3. Define the requirements and allocate functionality to hardware and software platforms according to the principles outlined above for the four major DSS components. It probably will not be possible to satisfy the requirements on one PC and a spreadsheet, but keep the number of platform technologies to a minimum consistent with the principles.
4. Constrain the requirements by the existing hardware and software portfolio in place where possible. Do not take on new learning curves unnecessarily. Avoid getting distracted by technology at all costs.
[1]The Computer Conference Analysis Newsletter, May 18, 1993 n319 p12(1)
[2]Inmon, W.H., Building the Data Warehouse, New York: John Wiley & Sons: 1993, p. 33.
[3]Roth, R.K., and Denna, E.L. (1994) “A Summary of Event-Driven System Concepts,” Price Waterhouse LLP, White Paper
[4]Theoretically, disk I/O bandwidth could be 8-20 megabytes per second, depending upon bus architecture, however, achievable throughput is dramatically lower as indicated.
Original Images
These are the original images of the report.