Introduction
Data warehousing is one of the most significant and current advances in the field of information systems (IS) which has become a better means of changing data into helpful information. (Kimball. 2002).pp25.It is an advanced way of combining data from several, often very large, distributed, different databases and other information sources. The concept of data warehousing is used mostly to build up decision support systems (DSS). According to Inmon W.H. (1992), data warehousing is a collection of decision support technologies, aimed at enabling the knowledgeable worker (executive, manager, and analyst) to make better and faster decisions.
Several companies are putting up large human, technical, and financial resources to developing and making use of data warehouses. For construction project management, a data warehouse can enable decision-makers, such as project managers, to use project and market information for meeting client needs. (Ahmad, I. et al. 1999). The major reason for these efforts is to give easy access to specially organized data that can be used with decision-support applications like management reporting, queries, decision support systems. (Kimball. 2002).pp4
According to Poe(1998)data warehousing includes; removing data from operational systems and other external sources; cleansing, polishing, and preparing data for decision support; maintaining data in appropriate data stores; accessing and analyzing data using a variety of end-user tools; and mining data for significant relationships. Data warehousing technologies have successfully been employed in many industries.
For example, in manufacturing for order shipment and customer support, in retail for user summary and inventory management, in financial services for claims analysis, risk analysis, credit card analysis, and fraud detection, in transportation for fleet management, in telecommunications for call analysis and fraud detection, in utilities for power usage analysis, and in healthcare for outcomes analysis (Adriaans and Zantinge, 1996) pp154.
History
Data Warehouses were developed during the late 1980s and early 1990s to meet a growing need for management information and analysis that could not be fulfilled by operational systems. According to Theodoratos, et al (1999), operational systems were not capable of meeting this need because of the following reasons;
- The processing weight of reporting decreased the response time of the operational systems;
- The database designs of operational systems were not capable of information investigation and reporting;
- Most organizations had more than one operational system, so company-wide reporting could not be supported from a single system and ;
- Development of reports in operational systems often required writing specific computer programs which were slow and expensive.
- Consequently, separate computer databases, which were specifically designed to maintain management information and analysis purposes, began to be built, Ahmad. I. et al.1999. It was possible then to bring in data from various data sources and combine this information in one place. (Inmon W.H. 1992) This, together with user-friendly reporting tools and freedom from operational databases, resulted in this type of computer system.
The Data Warehouse Architecture
Lukauskis, P. (1999) defines data warehouse architecture as a “description of the elements and services of the warehouse, with features showing how the components will fit together and how the system will grow over time” There is always architecture, either ad hoc or planned, but experience shows that planned architectures have a better chance of succeeding (Anahory, S., and Murray, D. 1997).
The data warehouse architecture consists of various interconnected elements which are:
- Operational and external database layer: the source data for the data warehouse.
- Informational access layer: the tools, the end-user access to extract and analyze the data.
- Data Access Layer: the interface between the operational and informational access layer.
- Metadata Layer: The data directory or repository of metadata information. (Inmon W.H. 1992)
Concept of data warehousing
The concept of data warehousing started in the early 1980s and was aimed at providing an architectural model for the movement of data from operational systems to decision support centers. It purposed to deal with the different setbacks linked with this movement and the high costs associated with it. Without such architecture, there existed a vast amount of redundancy in the delivery of management information. (Corey. M. et al.1998)In large corporations it was typical for multiple decision support projects to operate independently, each serving different users but often requiring much of the same data. (Poe, V.et al.1998)
The process of gathering, and combining data from various sources, often legacy systems, was typically repeated for each project. Moreover, legacy systems were regularly being revisited as new requirements surfaced, each requiring a delicately different view of the legacy data (Kimball. 2002).
Based on similarity with real-life warehouses, data warehouses were meant to be a comprehensive collection, storage, staging area for company data. (Lukauskis, P. 1999). From here data could be shared out to retail stores or data marts which were adapted for access by decision support systems, decision support users. (Ahmad. I. et al.1999), Whereas the data warehouse was intended to handle the bulk supply of data from operational systems and to handle the organization and storage of this data, the data mart could be centered on the packaging and presenting collections of the data to end-users, to meet definite management information requirements. (Anahory, S.et al (1997).
With time this analogy and the architectural dream were lost, as some sellers and industry speakers redefined the data warehouse as simply a management reporting database. This is a slight but significant departure from the basic dream of the data warehouse as the center of a management information architecture, where the decision support systems were definitely the data marts or retail stores.
Storage
In online transaction processing systems (OLTP) relational database design use the discipline of data modeling and generally follows the Codd rules of data normalization in order to ensure absolute data integrity. (Anahory, S.et al 1997). Less difficult information is broken into easy structures where all of the individual minute level elements relate to each other and suit the database normalization rules. Codd defines 5 increasingly strict rules of normalization and usually, the OLTP structure achieves a 3rd level normalization. (Lukauskis, P. 1999)
Completely normalized OLTP database designs result in having information from a business operation stored in several tables. Relational database managers are proficient at managing the links between tables and result in very fast insert/update performance because only a little bit of data is affected in each relational transaction. (Adriaans P. and D. Zantinge. 1996).
OLTP databases are efficient since they are essentially dealing only with the information around a particular transaction. Thousands of transactions may need to be modernized commanding a massive workload on the relational database in reporting and analysis. (Corey. M. et al.1998) With sufficient time the software can provide the desired results although data warehousing experts recommend reporting databases to be substantially separated from the OLTP database because of the negative performance impact on the machine and all of its mass applications. Additionally, data warehousing recommends that data be streamlined and rearranged to assist query and analysis by trainee users. (Anahory, S.et al 1997).
OLTP is intended to offer good performance by strictly defined applications built by programmers conversant with the limitations and conventions of the technology. Finally, the data warehouse should maintain high volumes of data assembled over extended periods of time and are subject to complex queries and need to content formats and definitions drawn from independently planned package and legacy systems. (Lukauskis, P. 1999)
Draining the data warehouse data Architecture synergy is the work of Data Warehouse Architects. The objective of a data warehouse is to bring together data from different existing databases to sustain management and reporting needs. (Lukauskis, P.1999). The most appreciated standard is that data should be stored at its most basic level since this provides for the most practical and flexible foundation for use in reporting and information investigation. (Poe, V.et al 1998). Nonetheless, because of different focus on precise requirements, there can be other methods for design and implementing data warehouses.
There are two most important approaches to organizing the data in a data warehouse: the dimensional approach advocated by Ralph Kimball and the normalized approach advocated by Bill Inmon. Although the dimensioning method is very helpful in data mart design, it can result in a rat’s nest of long-term data integration and abstraction difficulties when used in a data warehouse (Kimball, 2002). According to the dimensional approach, transaction data is separated into measured facts (which are fundamentally numeric data that captures definite values), and dimensions (which include the reference information that gives each transaction its perspective).
For example, the sales transaction can be separated up into details such as the number of goods ordered, the value paid, and dimensions would include the date, client, merchandise, physical location, and salesperson. The chief advantage of a dimensional approach is that the data warehouse is easy for company personnel with inadequate IT knowledge to comprehend and apply. (Anahory, S.et al 1997). Furthermore, the data warehouse tends to operate very fast since the data is pre-joined into the dimensional structure. The key disadvantage of this approach is that it is rather difficult to add or modify afterward when the corporation changes the technique of conducting business.
The normalized approach uses database normalization where data in the data warehouse is stored in third normal form. Tables are then clustered together by subject areas that reflect the general definition of the data (client, product, finance, etc.) (Ahmad. I. et al.1999), The main benefit of this approach is that it is fairly simple to add new information into the database whereas the primary shortcoming of this approach is that it can be somewhat time-consuming to produce information and reports because of the number of tables involved. (Inmon W.H. 1992) In addition, since the separation of facts and dimensions is not clear in this type of data model, it is difficult for users to join the required data elements into meaningful information without an accurate understanding of the data makeup.
A data warehouse collects all of the data into one system, organizes the data so that it is consistent and easy to read, keeps old data for historical analysis, and makes access to, and use of data easy so that users can do it themselves (Corey et al. 1998).
Theodoratos, et al (1999), have argued that data warehouses should be maintained separately from an organization’s transaction processing databases to reduce the impact that queries have on operational systems and safeguard operational data from being changed or lost. This, in essence, allows database administrators to merge fields from different systems to create new, subject-oriented data that end users can access directly using powerful graphical queries and reporting tools. (Inmon W.H. 1992) Another reason for separating the data warehouse from an organization’s OLTP databases is that the data warehouse supports online analytic processing (OLAP) which enables users to control the information stored in databases for complicated decision support analysis.
Since the data warehouse is intended especially for decision support queries, only data that is needed for decision support should be extracted from the operational database and stored in the data warehouse. (Poe, V.et al.1998) Decision-makers generate a variety of questions unlike those geared to transaction processing. Such queries arise from the need to analyze and process data in order to present conclusions. These questions are usually difficult and typically cover aspects not relevant to OLTP systems.
Getting accurate data from business operations is important to corporate success. Data extract, transform, and load (ETL) tools make it possible for the data warehouses to have clean, consistent data that is taken from various systems and positions. Issues such as complexity and scalability have to be considered in order to create a clean, reliable, timely source of data on which to base mission-critical business decisions. One way to achieve a scalable, non-complex solution is to adopt a hub-and-spoke architecture for the transformation process. (Ahmad. I. et al.1999).
Hub and spoke architecture
Hub and spoke architecture is customary enterprise application integration (EAI) architecture where the hub holds the information exchange and transformation for many applications or data stores, the spokes. (Theodoratos, et al. 1999). Most application integration leverages hub and spoke today; however, more advanced architectures are emerging based on distributed hubs. With this traditional hub-and-spoke arrangement, the number of boundaries between warehouse and marts is reduced, contrasted with a large number of interfaces in a point-to-point design.. (Poe, V.et al.1998)
The hub-and-spoke method is the favored design for businesses that aim at complete centralization and is the most efficient architecture that deals with the need of moving data throughout the modern business enterprise immediately, competently, and accurately (Theodoratos and Sellis, 1999):. Hub-and-spoke architecture fundamentally forms a middleware stage to which applications are combined, instead of being integrated into each other. According to Anahory, (1997) “the manageability of the integration task is significantly simplified with a data hub because it is capable of scaling more efficiently than a point-to-point solution. This architecture solves such e-business related problems as scalability of the enterprise architecture, and reduces coding requirements and development timeframes”.
The hub automates and manages the movement of data from unrelated sources, such as midrange systems, mainframes, Windows NT-based servers, or even proprietary file systems, and guarantees its safe, consistent arrival at the data warehouse. (Ahmad. I. et al.1999). The engine serves to change unrefined source data into vital information for decision-makers. Such systems save the time of highly paid developers. Additionally, the hub-and-spoke architecture allows for greater flexibility, scalability, and reliability as hubs can be mirrored to provide fault tolerance and system availability. The solution grows as your organizational requirements increase. (Theodoratos, et al.1999)
Data Warehouse Components
The following briefing explains the components of a data warehouse based upon the work of W. H. Inmon.
Current Detail
The heart of a data warehouse is its current detail. It is the place where the bulk of data resides. Current detail comes straight from operational systems and can be stored as raw data or as an aggregation of raw data. (Ahmad. I. et al.1999), Current detail, organized by subject area, represents the entire enterprise, rather than a given application. Current detail is the lowest level of data granularity in the data warehouse. Every data entity in current detail is a snapshot, at a moment in time, representing the instance when the data are accurate. The current detail is typically two to five years old. Current detail refreshment occurs frequently to support enterprise requirements.
System of Record
This refers to the source of the best data that supply the data warehouse. The best data are those that are mainly timely, absolute, accurate, and conform best to the data warehouse in structure. Frequently the best data are very close to the basis of entry into the production environment. In other cases, a system of record may be one containing already summarized data (Lukauskis, P. 1999).
Integration and Transformation Programs
Even the best operational data cannot usually be copied, as is, into a data warehouse. (Kimball. 2002).pp33. Raw operational data are practically inconceivable to the bulk of end-users. In addition, operational data rarely measures up to the rational, subject-oriented structure of a data warehouse. Moreover, various operational systems represent data differently, use different codes for the same thing, limit multiple pieces of information into one field, and more. Operational data can also come from many different physical sources: old mainframe files, non-relational databases, indexed flat files, even proprietary tape, and card-based systems. Thus operational data must be cleaned up, edited, and reformatted before being loaded into a data warehouse (Kimball. 2002).
As operational data items pass from their systems of record to a data warehouse, integration and transformation programs convert them from application-specific data into enterprise data. These integration and transformation programs perform functions such as:
- Reformatting, recalculating, or modifying key structures
- Adding time elements
- Identifying default values
- Supplying logic to choose between multiple data sources
- Summarizing, tallying and merging data from multiple sources
When either operational or data warehouse environments change, integration and transformation programs are modified to reflect that change. (Lukauskis, P. 1999)
Summarized Data
All enterprise departments do not have the same information requirements, so a successful data warehouse plan provides tailored, lightly summarized data for every enterprise element. An enterprise department may have access to both detailed and summarized data, but there will be much less than the total stored in current detail. (Theodoratos and Sellis, 1999):
Highly summarized data are purposely for company executives. Highly summarized data can come from either the lightly summarized data used by company elements or from current detail. Data volume at this level is much less than other levels and represents an assorted collection sustaining a wide variety of needs and interests. Besides access to highly summarized data, commonly executives also have the capability of accessing increasing levels of detail through a drill-down process.
Archives
Data warehouse archives contain old data (normally over two years old) of considerable, continuing importance and significance to the company. (Anahory, S.et al 1997). There is typically a massive amount of data amassed in the data warehouse archives that has a low frequency of access. Archive data are most frequently employed for forecasting and trend analysis. Although archive data may be stored with a similar level of granularity as current detail, it is more likely that archive data are aggregated as they are archived. Archives include not only old data; they also include the metadata that describes the old data’s characteristics. (Inmon W.H. 1992)
Metadata
Metadata is one of the most important components of a data warehouse. Also called data warehouse architecture, metadata is essential to all levels of the data warehouse, although it exists and works in a different aspect from other warehouse data. (Poe, V.et al 1998). Metadata that is used to manage and control data warehouse design and preservation exists outside the data warehouse. Complete data warehouse architecture includes data and technical elements. Theodoratos splits down the architecture into three main areas.
- data architecture which is centered on business processes.
- infrastructure, which includes hardware, networking, operating systems, and desktop machines.
- The technical area includes the decision-making technologies that will be required by the users, as well as their supporting structures. (Theodoratos, et al 1999).
Data Architecture
The data architecture segment of the general data warehouse architecture is driven by business processes. In a manufacturing setting, the data form may include orders, shipping, and billing. Each area draws on a different set of dimensions. (Anahory, S.et al 1997). In case dimensions intersect in the data model the classifications have to be the same customer who buys is the same that builds. So data items should have a familiar structure and substance, and absorb a single process to generate and preserve.
Infrastructure Architecture
Occasionally with the required hardware platform and boxes, the data warehouse becomes it’s own IS shop. In reality, there are many boxes in data warehousing, frequently used for databases and application servers. The issues with hardware and DBMS choices are size, scalability, and flexibility. (Anahory, S.et al 1997).
Technical Architecture
The technical architecture is driven by the metadata catalog. Everything should be metadata-driven. “The services should draw the needed parameters from tables, rather than hard-coding them,” says Theodoratos. An important component of the technical architecture is the data staging process.
Advantages of data warehouse
The data warehouse approach presents some advantages over the traditional on-demand approach (Theodoratos and Sellis, 1999):
- High query performance can be attained for complex aggregation queries that are needed for in-depth analysis, decision support, and data mining because queries can be answered locally without accessing the original information sources.
- Because Online Analytical Processing (OLAP) is separated from Online Transaction Processing (OLTP) information is made accessible to decision-makers, therefore avoiding interference of OLAP with local processing at the operational sources.
- Data warehouses boost end-user access to a wide selection of data.
- Decision support system users can get particular trend reports, e.g. the item with the most sales in a particular area within the last two years.
- Data warehouses can be a significant enabler of commercial business applications, particularly customer relationship management (CRM) systems.
- A data warehouse permits the reduction of staff and computer resources needed to support queries and reports against operational and production databases. This usually offers considerable savings. A data warehouse also abolishes the resource drain on production systems when performing long-running, complex queries and reports.
- Better enterprise intelligence. Improved quality and flexibility of company analysis begin from the multi-tiered data structures of a data warehouse that maintain data ranging from detailed transactional level to high-level summary. Guaranteed data correctness and dependability result from guaranteeing that a data warehouse contains only trusted data.
- Superior customer service. An organization can preserve better customer relationships by connecting all customer data by means of a single data warehouse
- Re-Engineering of the business process. An unlimited examination of enterprise information often offers insights into enterprise procedures which may give way to breakthrough ideas for re-engineering those processes. Defining the requirements for a data warehouse results in better enterprise goals and measures.
- Re-Engineering of IS. A data warehouse that is based upon enterprise-wide data requirements provides a cost-effective means of establishing both data standardization and operational system interoperability.
Limitations
- Extracting, transforming, and loading data consumes a lot of time and computational resources.
- Data warehousing project scope must be actively managed to deliver a release of defined content and value.
- Compatibility problems with systems already in place.
- Security could develop into a serious issue, especially if the data warehouse is web-accessible.
- Data Storage design controversy deserves careful consideration and perhaps prototyping of the data warehouse solution for each project’s environments (Ahmad. I. et al.1999).
Conclusion
Data Warehouses developed to meet the growing need for management information and analysis that could not be fulfilled by operational systems. As a result, it became possible to bring in data from various data sources and integrate this information in one place. With the advancement in technology and increased user requirements, data warehouses have developed through several basic phases: Offline Operational Databases, Offline Data warehouses, Real-Time Data warehouses.
Data warehousing supports online efficient processing for long-term decision-making. With successful data warehouse implementation in organizations, managers will be less dependent on IT professionals and will be using data more successfully. Complete data warehouse architecture includes data and technical elements; data architecture which is centered on business processes, infrastructure, that includes hardware, networking, operating systems, and desktop machines and the technical area which includes the decision-making technologies that will be required by the users, as well as their supporting structures.
The contemporary enterprise requires considering ETL solutions that are better managed and easy to realize. These solutions are best served by the flexibility and competence of the hub-and-spoke architecture which provides a centralized and open storage area that companies can influence in order to maintain full control of all data exchange processes, business rules, and metadata., The hub-and-spoke architecture empowers knowledge workers to make better, more efficient use of business intelligence and analytic applications by centralizing business information management
References
Adriaans P. and D. Zantinge. 1996, Data Mining, Addison Wesley Longman Limited, Reading, Massachusetts.
Ahmad. I. et al. 1999, “Data warehousing in the construction industry: Organizing and processing data for decision-making”. Vancouver, British Columbia.
Anahory, S.et al (1997). Data Warehousing in the Real World: A Practical Guide for Building Decision Support Systems, Addison Wesley, Massachusetts.
Corey. M. et al.1998, Oracle 8 Data Warehousing-A practical Guide to Successful Data Warehouse Analysis. ORACLE Press.
Inmon W.H. 1992, Building the Data Warehouse. John Wiley.
Inmon W.H. 1992b, Rdb/VMS: Developing the Data Warehouse. John Wiley.
Kimball. 2002, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), from Ralph Kimball, Margy Ross, Willey.
Lukauskis, P. 1999, A Decision Support System Using Data Warehousing and GIS to Assist Builders/Developers in Site Selection, M.S. Thesis, Florida International University, Miami, Florida, USA.
Poe, V.et al. 1998, Building a Data Warehouse for Decision Support, Prentice Hall, Upper Saddle River, New Jersey.
Theodoratos, et al. 1999, “Design data warehouse”. Data and Knowledge Engineering, Vol. 31, pp. 279-301.