What Is a Data Warehouse?
What
Is a Data Warehouse?
Data
warehouse is a relational database that is designed for query and analysis. Data
warehouse (DW or DWH), also known as an enterprise data
warehouse (EDW), is a system used for reporting and data analysis.
DWs are central repositories of integrated data from one or more disparate
sources. They store current and historical data and are used for creating
trending reports for senior management reporting such as annual and quarterly
comparisons.
The
data stored in the warehouse is uploaded from the operational systems (such as
marketing, sales, etc., shown in the figure to the right). The data may pass
through an operational data store for additional operations before it is used
in the DW for reporting.
This
helps in:
·
Maintaining
historical records
·
Analyzing the
data to gain a better understanding of the business and to improve the business
Operational
and Informational systems
Operational
Systems:
An operational
system is a term used in data warehousing to refer to a system that is used
to process the day-to-day transactions of an organization. These systems are
designed in a manner that processing of day-to-day transactions is performed
efficiently and the integrity of the transactional data is preserved.
Sometimes
operational systems are referred to as operational databases, transaction
processing systems, or online transaction processing systems (OLTP). However,
the use of the last two terms as synonyms may be confusing, because operational
systems can be batch processing systems as well.
Any
Enterprise must necessarily maintain a lot of data about its operation. This is
its "Operational Data".
Organization
|
Probably
|
Manufacturing
Company
|
Product
data
|
Bank
|
Account
Data
|
Hospital
|
Patient
Data
|
University
|
Student
Data
|
Government
Department
|
Planning
data
|
Informational
Systems
Information
system, an integrated set of
components for collecting, storing, and processing data and for delivering
information, knowledge, and digital products. Business firms and other
organizations rely on information systems to carry out and manage their
operations, interact with their customers and suppliers, and compete in the
marketplace. For instance, corporations use information systems to reach their
potential customers with targeted messages over the Web, to process financial
accounts, and to manage their human resources.
Components
of information systems
The
main components of information systems are computer hardware and software, telecommunications,
databases and data warehouses, human resources, and procedures. The hardware,
software, and telecommunications constitute information technology (IT), which
is now ingrained in the operations and management of organizations.
OLTP
and DSS Systems
OLTP
(Online Transaction Processing)
Online
Transaction processing database applications are optimal for managing changing
data, and usually have a large number of users who will be simultaneously
performing transactions that change real-time data. Although individual
requests by users for data tend to reference few records, many of these
requests are being made at the same time. Common examples of these types of
databases are airline ticketing systems and banking transaction systems.
The
primary concerns in this type of application are concurrency and atomicity.
Concurrency
controls in a database system ensure
that two users cannot change the same data, or that one user cannot change a
piece of data before another user is done with it. For example, if you are
talking to an airline ticket agent to reserve the last available seat on a
flight and the agent begins the process of reserving the seat in your name,
another agent should not be able to tell another passenger that the seat is
available.
Atomicity
ensures that all of the steps
involved in a transaction complete successfully as a group. If any step fails,
no other steps should be completed. For example, a banking transaction may
involve two steps: taking funds out of your checking account and placing them
into your savings account. If the step that removes the funds from your
checking account succeeds, you want to make sure that the funds are placed into
your savings account or put back into your checking account.
Transaction
processing system databases should be designed to promote:
·
Good data
placement.
·
Short
transactions to minimize long-term locks and improve concurrency.
·
Online backup.
·
High
normalization of the database.
·
Little or no
historical or aggregated data
· Careful use of indexes.
DSS
(Decision Support System)
Decision-support
database applications are optimal for data queries that do not change data. For
example, a company can periodically summarize its sales data by date, sales
region, or product and store this information in a separate database to be used
for analysis by senior management.
To
make business decisions, users need to be able to determine trends in sales
quickly by querying the data based on various criteria. However, they do not
need to change this data. The tables in a decision-support database are heavily
indexed, and the raw data is often preprocessed and organized to support the
various types of queries to be used. Because the users are not changing data,
concurrency and atomicity issues are not a concern; the data is changed only by
periodic, bulk updates made during off-hour, low-traffic times in the database.
Decision-support
system databases should be designed to promote:
Heavy
Indexing
Decision-support
systems have low update requirements but large volumes of data. Use many
indexes to improve query performance.
Denormalization
of the database
Introduce
pre-aggregated or summarized data to satisfy common query requirements and
improve query response times.
Use
of a star or snowflake schema to organize the data within the database.
Characteristics
of Data Warehouse
Some data is denormalized for simplification and to
improve performance.
Large amounts of historical data are used.
Queries often retrieve large amounts of data.
Both planned and ad-hoc queries are common.
The
data load is controlled.
Subject
Oriented:
Data
warehouses are designed to help you analyze data. For example, to learn more
about your company's sales data, you can build a warehouse that concentrates on
sales.
Integrated:
Integration is closely
related to subject orientation. Data warehouses must put data from disparate
sources into a consistent format. They must resolve such problems as naming
conflicts and inconsistencies among units of measure.
Nonvolatile
Nonvolatile
means that, once entered into the warehouse, data should not change. This is
logical because the purpose of a warehouse is to enable you to analyze what has
occurred.
Time
Variant
In
order to discover trends in business, analysts need large amounts of data. This
is very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A
data warehouse's focus on change over time is what is meant by the term time
variant.
Data
Warehouse software and hardware architecture
In
computing, the term data warehouse appliance (DWA) was coined by Foster
Hinshaw to define a new category of computer architecture for data warehousing
(DW) specifically targeted for Big Data Analytics and Discovery that is (a)
simple to use (not a pre-configuration) and (b) very high performance for this
workload. A DWA includes an integrated set of servers, storage, operating system(s),
and DBMS.
In
marketing, the term has evolved to include pre-installed and pre-optimized
hardware and software as well as similar software-only systems promoted as easy
to install on specific recommended hardware configurations or preconfigured as
a complete system. These are marketing uses of the term and do not reflect the
technical definition.
At
its core, a DWA is designed specifically for high performance big data
analytics and is delivered as an easy-to-use packaged solution. The internal software
(and often hardware) constructs of a DWA differ significantly from a
traditional stack in that they are written for a target workload and not a
generic general purpose workload.
The
data warehouse appliance (DWA) has several characteristics which differentiate
that architecture from similar machines in a data center, such as an enterprise
data warehouse (EDW).
1. A DWA has a very tight integration of its internal
components which are optimized for "data-centric" operations in
contrast to "compute-centric" operations. The latter tend to
emphasize number of CPU's, cores and network bandwidth.
2. A DWA is trivial to use and install. In contrast to a
"pre-configuration" of components, a DWA has very few configuration
switches or options.
3. A DWA is optimized for analytics on Big Data.
Most DW appliances use
massively parallel processing (MPP) architectures to provide high query
performance and platform scalability. MPP architectures consist of independent
processors or servers executing in parallel.
Comments
Post a Comment