Senin, 16 Juli 2018

Sponsored Links

Informatica Data Warehousing Concepts For Beginners - Part 1 - YouTube
src: i.ytimg.com

In computing, the data warehouse ( DW or DWH ), also known as corporate data warehouse ( EDW ), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DW is an integrated data center repository of one or more different sources. They store current and historical data in one place used to create analytical reports for workers throughout the company.

Data stored in the warehouse is uploaded from an operational system (such as marketing or sales). Data may pass through operational data storage and may require data cleaning for additional operations to ensure data quality before being used in DW for reporting.

The typical data warehouse Extract, transform, load (ETL) uses staging, data integration, and access layer to the home of its main function. The staging layer or the staging database stores the raw data extracted from each of the different data source systems. The integration layer integrates different data sets by changing the data from the staging layer often storing this modified data in an operational data storage (ODS) database. The integrated data is then transferred to another database, often called the data warehouse database, where data is organized into hierarchical groups, often called dimensions, and becomes aggregated facts and facts. The combination of facts and dimensions is sometimes called a star scheme. The access layer helps users retrieve data.

The main sources of data are cleaned, changed, cataloged, and available for use by managers and other business professionals for data mining, online analytics processing, market research, and decision support. However, the means to retrieve and analyze data, to extract, modify, and load data, and to manage data dictionaries are also considered to be an essential component of the data warehousing system. Many references to data warehousing use this broader context. Thus, the expanded definition for data warehousing includes business intelligence tools, tools for extracting, converting, and loading data into repositories, and tools for managing and retrieving metadata.


Video Data warehouse



Benefits

A data warehouse stores a copy of the information from the source transaction system. The complexity of this architecture provides opportunities to:

  • Integrates data from multiple sources into one database and data model. More congregational data to a single database so that a single query engine can be used to present data in ODS.
  • Reduce the issue of lock disengage of database isolation levels in transaction processing systems caused by attempting to run large, long-running query analytics in transaction processing databases.
  • Keep data history, even if the source transaction system is not.
  • Integrates data from multiple source systems, enabling a central view across the enterprise. These benefits are always valuable, but especially when the organization has grown with mergers.
  • Improve data quality, by providing consistent codes and descriptions, marking or even correcting bad data.
  • Show organizational information consistently.
  • Provides a single, general data model for all interesting data regardless of the data source.
  • Restructuring data makes sense for business users.
  • Restructure data to provide excellent query performance, even for complex analytics queries, without affecting the operating system.
  • Add value to operational business applications, especially customer relationship management (CRM) systems.
  • Make decision support questions easier to write.
  • Manage and separate recurring data.

Maps Data warehouse



General environment

The environment for data warehouses and marts includes the following:

  • The source system that provides data to the warehouse or mart;
  • The technology and process of data integration required to prepare data for use;
  • Different architectures for storing data in data warehouses or organizational data markets;
  • Different tools and apps for different users;
  • Metadata, data quality, and governance processes must exist to ensure that the warehouse or mart meets its objectives.

Regarding the source system listed above, R. Kelly Rainer stated, "The common source for data in a data warehouse is the company's operational database, which can be a relational database".

Regarding data integration, Rainer stated, "Need to extract data from the source system, change it, and load it into data mart or warehouse".

Rainer discusses storing data in the data warehouse or the organization's data market.

Metadata is data about data. "IT personnel need information about data sources, databases, tables, and column names, refresh schedules, and data usage steps".

Today, the most successful companies are those who can respond quickly and flexibly to market changes and opportunities. The key to this response is the effective and efficient use of data and information by analysts and managers. "Data warehouse" is a place of storing of historical data arranged by subject to support decision maker in organization. Once data is stored in data mart or warehouse, they are accessible.

Data Warehousing and Data Management - b.telligent
src: www.btelligent.com


Related systems (mart data, OLAP, OLTP, predictive analysis)

Data marts are a simple form of data warehouse focused on a single subject (or functional area), so they take data from a limited number of sources such as sales, finance or marketing. Mart data is often built and controlled by one department within an organization. The source can be an internal operational system, a central data warehouse, or external data. Denormalization is the norm for data modeling techniques in this system. Given that data marts generally only include some of the data contained in the data warehouse, they are often easier and faster to implement.

Data mart types include dependent, independent, and hybrid data.

Online analytic processing (OLAP) is characterized by relatively low transaction volumes. Questions are often very complicated and involve aggregation. For OLAP systems, response time is a measure of effectiveness. OLAP applications are widely used by Data Mining techniques. The OLAP database stores historical data collected in a multi-dimensional scheme (usually a star schema). OLAP systems typically have several hours of data latency, compared to data marts, where latency is expected to be close to one day. The OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are: Roll-up (Consolidation), Drill-down and Slicing & amp; Dicing.

Online transaction processing (OLTP) is marked by a large number of short online transactions (INSERT, UPDATE, DELETE). OLTP systems emphasize very fast query processing and maintain data integrity in multi-access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. The OLTP database contains detailed and up-to-date data. The scheme used to store a transactional database is an entity model (usually 3NF). Normalization is the norm for data modeling techniques in this system.

Predictive analysis is about finding and measuring hidden patterns in data using complex mathematical models that can be used to predict future results. Predictive analysis differs from OLAP in OLAP that focuses on the analysis of historical data and is reactive, while predictive analysis focuses on the future. This system is also used for customer relationship management (CRM).

The Pros & Cons of Data Warehouses | Business Impact
src: businessimpactinc.com


History

The concept of warehousing data dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed a "business data warehouse". In essence, the concept of data warehousing is intended to provide an architectural model for the flow of data from an operational system to a decision support environment. This concept seeks to address the various problems associated with this flow, especially the high costs associated with it. In the absence of data warehousing architecture, a large amount of redundancy is required to support multiple decision support environments. In big companies, it's typical for some decision support environments to operate independently. Although each environment serves different users, they often require a lot of the same stored data. The process of collecting, cleaning and integrating data from multiple sources, usually from existing long-term operating systems (commonly referred to as legacy systems), is usually partially replicated for each environment. In addition, operational systems are often reviewed as new decision support requirements arise. Often new requirements require the collection, cleaning and integration of new data from "mart data" designed for ready access by users.

The major developments in the early years of data warehousing were:

  • 1960 - General Mills and Dartmouth College, in joint research projects, developed the term dimensions and fact .
  • 1970 - ACNielsen and IRI provide mart dimension data for retail sales.
  • 1970 - Bill Inmon starts defining and discussing terms: Data Warehouse.
  • 1975 - Sperry Univac introduces MAPPER (MAintain, Prepares, and Generates an Executive Report) is a database management and reporting system that includes the world's first 4GL. The first platform designed to build the Information Center (pioneer of contemporary data storage technology)
  • 1983 - Teradata introduces DBC/1012 database computer designed specifically for decision support.
  • 1984 - Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases hardware/software packages and GUIs for business users to create database management and analytics systems.
  • 1985 - Sperry Corporation publishes articles (Martyn Jones and Philip Newman) in information centers, where they introduce the term MAPPER data warehouse in the context of an information center.
  • 1988 - Barry Devlin and Paul Murphy publish an Architecture article for business systems and information where they introduce the term "business data warehouse".
  • 1990 - Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a dedicated database management system for data warehousing.
  • 1991 - The Prisma solution, founded by Bill Inmon, introduces Prism Warehouse Manager, software to develop data warehouses.
  • 1992 - Bill Inmon publishes book Build Data Warehouse .
  • 1995 - The Data Warehousing Institute, a nonprofit organization that promotes data warehousing, was founded.
  • 1996 - Ralph Kimball publishes the book The Data Warehouse Toolkit .
  • 2012 - Bill Inmon develops and creates public technology known as "textual disambiguation". Textual disambiguation applies context to raw text and reformats raw text and context into standard database format. Once raw text is passed through text disambiguation, it can be easily and efficiently accessed and analyzed by standard business intelligence technology. Textual disambiguation is achieved through textual ETL implementation. Text disambiguation is useful wherever raw text is found, such as in documents, Hadoop, email, and so on.

Unizin Data Platform - Unizin
src: unizin.org


Information storage

Fact

Facts are values ​​or measurements, which represent facts about a managed entity or system.

The facts, as reported by the reporting entity, are said to be at the raw level. For example. in a cellular phone system, if the base transceiver station (BTS) receives 1,000 requests for the allocation of traffic channels, allocated to 820, and refuses the rest, it will report 3 facts or measurements to management system :

  • tch_req_total = 1000
  • tch_req_success = 820
  • tch_req_fail = 180

The facts at the raw level are then aggregated to a higher level in various dimensions to extract more services or information relevant to the business than that. This is called aggregate or aggregate summary or fact.

For example, if there are 3 base stations in a city, then the above facts can be collected from the base station to the city level in the network dimension. As an example:

  • tch_req_success_city = tch_req_success_bts1 tch_req_success_bts2 tch_req_success_bts3
  • avg_tch_req_success_city = (tch_req_success_bts1 tch_req_success_bts2 tch_req_success_bts3)/3

Dimensional approach versus normalization for data storage

There are three or more major approaches for storing data in a data warehouse - the most important approach is the normalized approach of dimensions and approaches.

Dimension approach

The dimension approach refers to Ralph Kimball's approach where it is stated that the data warehouse should be modeled using the Dimension/star model scheme. The normalization approach, also called the 3NF model (Third Normal Form) refers to Bill Inmon's approach which states that data warehouses should be modeled using the model E-R/normalized model.

In the dimension approach, transaction data is partitioned into "facts", which are generally numeric transactional data, and "dimensions", which are reference information that provides context to facts. For example, sales transactions can be broken down into facts such as the number of ordered products and the total price paid for the product, and into dimensions such as the order date, customer name, product number, delivery order to and bill-to location, and responsible salesperson to receive orders.

The main advantage of the dimensional approach is that the data warehouse is easier for the user to understand and use. Also, data retrieval from the data warehouse tends to operate very quickly. The dimensional structure is easy to understand for business users, because the structure is divided into sizes/facts and context/dimensions. Facts relate to organizational business processes and operational systems while the dimensions around them contain a context of measurement (Kimball, Ralph 2008). Another advantage offered by the dimension model is that it does not involve relational databases every time. Thus, this type of modeling technique is very useful for end user demand in the data warehouse.

Factual models and dimensions can also be understood as cube data. Where dimensions are categorical coordinates in multi-dimensional cubes, whereas the facts are values ​​corresponding to the coordinates.

The main disadvantages of the dimensional approach are as follows:

  1. To preserve the integrity of facts and dimensions, loading data warehouses with data from different operational systems is complex.
  2. It is difficult to change the data warehouse structure if an organization adopting a dimensional approach changes the way it does business.

The normalization approach

In a normalized approach, the data in the data warehouse is kept following, to some extent, the database normalization rules. Tables are grouped by subject areas that reflect general data categories (e.g., customer data, product, finance, etc.). The normalized structure divides the data into entities, which create multiple tables in a relational database. When applied in large companies the result is dozens of tables linked together by a composite network. Next, each created entity is converted into a separate physical table when the database is implemented (Kimball, Ralph 2008). The main advantage of this approach is that it is easy to add information to the database. Some disadvantage of this approach is that, due to the number of tables involved, it may be difficult for users to combine data from different sources into meaningful information and to access information without a proper understanding of data sources and data structures from the data warehouse.

Both normal and dimensional models can be represented in entity relationship diagrams because they contain linked relational tables. The difference between the two models is the degree of normalization (also known as Normal Form). This approach is not mutually exclusive, and there is another approach. A dimensional approach may involve the normalization of data to some degree (Kimball, Ralph 2008).

In Information-Based Businesses, Robert Hillard proposes an approach to comparing the two approaches based on the information needs of business issues. This technique shows that normalization models store much more information than their dimensional equations (even when the same field is used in both models) but this additional information comes with usability costs. This technique measures the quantity of information in terms of information entropy and usability in terms of the size of the Small Worlds data transformation.

Introduction to Datawarehouse in hindi - YouTube
src: i.ytimg.com


Design method

Bottom-up design

In the bottom-up approach, data marts were first created to provide reporting and analytical capabilities for specific business processes. This data mart can then be integrated to create a comprehensive data warehouse. The architecture of the data warehouse bus is essentially an implementation of the "bus", a set of corresponding dimensions and corresponding facts, which are the shared dimensions (in some way) between facts in two or more data marts.

Top-down design

The top-down approach is designed using a normalized enterprise data model. "Atom" data, that is, the data at the greatest detail level, is stored in the data warehouse. Dimensional data marts containing the data required for a particular business process or a particular department are created from the data warehouse.

Hybrid design

Data warehouse (DW) often resembles hub and radius architecture. Inheritance systems that feed the warehouse often include customer relationship management and enterprise resource planning, generating large amounts of data. To consolidate these various data models, and facilitate the process of extracting load changes, data warehouses often utilize operational data storage, information from being parsed into the actual DW. To reduce data redundancy, larger systems often store data in a normalized way. Data marts for a particular report can be built on top of the data warehouse.

The hybrid DW database is stored in the third normal form to eliminate data redundancy. Normal relational databases, however, are not efficient for business intelligence reports where dimensional modeling is prevalent. Small data centers can shop for data from consolidated warehouses and use filtered and specific data for needed facts and dimensions tables. DW provides a single source of information from which data marts can read, providing a variety of business information. Hybrid architecture allows DW to be replaced by master data storage management where operational, not static information can be located.

The data modeling dome component follows the hub and radius architecture. This modeling style is a hybrid design, which consists of best practices of the third normal form and star scheme. The data dome model is not the correct third true form, and breaks some of its rules, but this is a top-down architecture with bottom-up design. The data vault model is designed to be a rigorous data warehouse. It is not directed to be an accessible end user, which, when built, still requires the use of a mart data-based release area or schema star for business purposes.

Late-Binding Data Warehouse
src: www.healthcatalyst.com


Characteristics of data warehouse

There are basic features that define data in data warehouses that include subject orientation, data integration, time variants, non-volatile data, and data breakdown.

Subject-Oriented

Unlike the operational system, data in the data warehouse revolves around the subject of the company (database normalization). Subject orientation can be very useful for decision making. Gathering the required objects is called subject-oriented.

Integrated

Data found in the integrated data warehouse. Because it comes from several operational systems, all inconsistencies must be removed. Consistency includes naming conventions, variable measurements, encoding structures, physical data attributes, and so on.

Time-variant

While operational systems reflect current values ​​because they support day-to-day operations, data warehouse data represent data over long periods (up to 10 years) which means storing historical data. This is primarily meant for data mining and forecasting, If a user searches for a particular customer's purchase pattern, the user needs to see data on current and past purchases.

Nonvolatile

Data in the data warehouse is read-only which means it can not be updated, created, or deleted.

Summarized

In data warehouses, data are summarized at different levels. Users can begin to see the total unit of product sales across the region. Then the user sees the state in the region. Finally, they can check individual stores in certain circumstances. Therefore, usually, the analysis starts at a higher level and moves down to a lower level of detail.

Designing the logical data warehouse - YouTube
src: i.ytimg.com


Data warehouse architecture

Different methods used to build/organize data warehouses determined by the organization are numerous. The hardware used, the software created and the necessary data resources specific to the proper functioning of the data warehouse are the major components of the data warehouse architecture. All data warehouses have several phases in which organizational requirements are modified and improved.

Introduction to Datawarehouse in hindi - YouTube
src: i.ytimg.com


Versus operating system

The operational system is optimized to maintain data integrity and business transaction recording speed through the use of database normalization and entity relationship models. The operating system designers generally follow the 12d rule of database normalization to ensure data integrity. A fully normalized database design (that is, those that meet all Codd rules) often generate information from business transactions that are stored in dozens to hundreds of tables. Relational databases are efficient in managing the relationships between these tables. The database has a very fast insert/update performance because only a small amount of data in the table is affected every time a transaction is processed. To improve performance, older data is usually cleaned periodically from the operating system.

The data warehouse is optimized for analytic access patterns. The analytic access pattern generally involves selecting specific fields and rarely if ever 'select *' as is more common in operational databases. Due to differences in this access pattern, operational (loose, OLTP) databases benefit from the use of line-oriented DBMS whereas analytics databases (loose, OLAP) benefit from the use of column-oriented DBMS. Unlike operational systems that maintain business snapshots, data warehouses generally maintain an unlimited history that is implemented through an ETL process that periodically migrates data from the operational system to the data warehouse.

Data Warehouse Solution J63 On Wow Home Design Plan with Data ...
src: www.diazepamresource.com


Evolution in organizational use

These terms refer to the sophistication of the data warehouse:

Offline data warehouse
The data warehouse in this evolution phase is updated on regular time cycles (usually daily, weekly or monthly) of the operational system and data stored in integrated reporting data
offline data warehouse
The data warehouse at this stage is updated from the data in the operational system on a regular basis and data warehouse data is stored in a data structure designed to facilitate reporting.
Timely data warehouse
Integrated Online Warehouse Data represents real time data warehouse data in the updated warehouse for each transaction performed on the source data
Integrated data warehouse
This data warehouse collects data from different business areas, so users can search the information they need on other systems.

Designing the logical data warehouse - YouTube
src: i.ytimg.com


See also


Data Warehouse Solution J63 On Wow Home Design Plan with Data ...
src: www.diazepamresource.com


References


Late-Binding Data Warehouse
src: www.healthcatalyst.com


Further reading

  • Davenport, Thomas H. and Harris, Jeanne G. (2007) Harvard Business School Press. ISBN 978-1-4221-0332-6
  • Ganczarski, Joe. Implementation of Data Warehouse: Study of Critical Implementation Factors (2009) VDM Verlag ISBNÃ, 3-639-18589-7 ISBNÃ, 978-3-639-18589-8
  • Kimball, Ralph and Ross, Margy. The Data Warehouse Toolkit Third Edition (2013) Wiley, ISBN 978-1-118-53080-1
  • Linstedt, Graziano, Hultgren. Vault Data Modeling Business Second Edition (2010) And linstedt, ISBN 978-1-4357-1914-9
  • William Inmon. Building Data Warehouse (2005) John Wiley and Sons, ISBN 978-81-265-0645-3


Source of the article : Wikipedia

Comments
0 Comments