Data Warehousing — Lecture 1
Dimensional Modelling Introduction for Data Warehousing — Part 1
Dimensional modeling is the preferred technique for presenting analytic data in data warehousing and business intelligence (DW/BI) systems. This method is highly regarded because it satisfies two key requirements:
- Understandability for Business Users: Dimensional models are intuitive and easy to comprehend, which facilitates their adoption and use by business users.
- Fast Query Performance: The structure of dimensional models supports efficient query processing, ensuring quick retrieval of data.
The Essence of Dimensional Modeling
Dimensional modeling simplifies databases to align with the human need for simplicity. Over the past five decades, this approach has been embraced by IT organizations, consultants, and business users for its clarity and efficiency. A simple model not only aids user comprehension but also allows software to navigate and process data efficiently.
Imagine a business described as selling products in various markets and measuring performance over time. This scenario lends itself to a dimensional structure where data is visualized as a cube with dimensions like product, market, and time. Each point within the cube represents a combination of these dimensions, storing measurements such as sales volume or profit.
Albert Einstein’s philosophy, “Make everything as simple as possible, but not simpler,”
encapsulates the guiding principle of dimensional modeling. A model that begins with simplicity has a better chance of remaining manageable and performing well.
Dimensional Models vs. Normalized Models
While dimensional models are often implemented in relational databases, they differ significantly from third normal form (3NF) models.
- Normalized Models (3NF): These models aim to eliminate data redundancies by dividing data into many discrete entities, each becoming a relational table. This results in complex structures, often resembling spider webs, which are difficult for users to navigate and understand.
- Dimensional Models: These models focus on understandability and performance by organizing data into a star schema or an OLAP (online analytical processing) cube, which is more intuitive and efficient for querying.
Star Schemas and OLAP Cubes
- Star Schemas: Implemented in relational databases, star schemas have a central fact table connected to surrounding dimension tables, resembling a star.
- OLAP Cubes: Implemented in multidimensional database environments, OLAP cubes store data and indices designed for dimensional data, offering advanced query performance and analytical capabilities. They allow users to drill down or roll up data efficiently without issuing new queries.
Key Components of a Dimensional Model
Fact Tables
Fact tables store the performance measurements resulting from business process events. Each row in a fact table corresponds to a specific measurement event, such as a sales transaction, and is characterized by a grain, which defines the level of detail (e.g., one row per product sold).
Characteristics of Fact Tables
- Additive Facts: Numeric and additive facts (e.g., sales dollars) are the most useful because they can be summed across various dimensions.
- Grain: All measurement rows in a fact table must be at the same grain to avoid double-counting and ensure consistency.
- Foreign Keys: Fact tables have foreign keys that connect to primary keys in dimension tables, ensuring referential integrity.
Dimension Tables
Dimension tables provide the textual context for the measurements in fact tables. They describe the “who, what, where, when, how, and why” of the business events.
Characteristics of Dimension Tables
- Attributes: Dimension tables contain many attributes (columns) that serve as the primary source of query constraints, groupings, and report labels. These attributes should be descriptive and user-friendly.
- Hierarchies: Dimension tables often represent hierarchical relationships (e.g., products roll up into brands and then into categories), which are stored redundantly for simplicity and performance.
- Denormalization: Dimension tables are typically demoralized to include hierarchical information, avoiding the complexity of snowflaking (normalizing dimensions into separate tables).
Benefits of Dimensional Modeling
- Simplicity: Dimensional models are straightforward, making them easy for business users to understand and navigate.
- Performance: The structure of dimensional models allows database optimizers to process queries efficiently.
- Extensibility: Dimensional models can gracefully accommodate changes, such as adding new dimensions or facts, without requiring a complete redesign or data reload.
Dimensional modeling leverages the complementary nature of fact and dimension tables, providing a robust framework for DW/BI systems that balances simplicity, performance, and flexibility.
Granularity in Dimensional Models
Granularity refers to the level of detail represented by the fact table. It is essential to define the granularity of your fact tables carefully. The grain must be clear and consistent across all facts within the table to avoid double-counting or misinterpretation of the data. For example, if the fact table’s grain is defined as one row per product sold per transaction, every fact (e.g., Sales Dollars, Sales Units) must be recorded at that same level of detail.
Types of Fact Tables
There are three primary types of fact tables: transaction, periodic snapshot, and accumulating snapshot.
- Transaction Fact Tables: These are the most common and record individual transaction events. Each row in the table corresponds to a single transaction, such as a sale or an order.
- Periodic Snapshot Fact Tables: These capture the state of certain aspects of the business at regular intervals (e.g., daily, weekly). This is useful for tracking metrics like daily inventory levels or monthly account balances.
- Accumulating Snapshot Fact Tables: These track the progress of a process over time, such as an order fulfillment process, capturing key milestones within a single row. This type of fact table is less common and is typically used for processes with well-defined stages.
Designing Dimension Tables
When designing dimension tables, it is important to include descriptive attributes that add meaningful context to the data. These attributes should be:
- Verbose: Use full descriptive text rather than codes or abbreviations.
- Business-Friendly: Ensure that the terminology and structure are understandable to business users.
- Hierarchical: Capture hierarchical relationships within the dimension to enable drill-down analysis (e.g., product category, brand, SKU).
Dimension tables can sometimes contain hierarchical data, which might be normalized in operational databases but should be demoralized for analytical purposes in dimensional models. This denormalization, known as “flattening,” enhances query performance and simplicity.
Star Schema Versus Snowflake Schema
A star schema is characterized by a central fact table surrounded by demoralized dimension tables, creating a star-like structure. This simplicity makes star schemas easy to understand and use for business users and optimized for query performance.
A snowflake schema, on the other hand, normalizes the dimension tables, resulting in additional layers of tables and more complex joins. While normalization reduces redundancy and can save space, it generally complicates the schema and may hinder query performance. Therefore, star schemas are preferred for most DW/BI applications.
Summary
Dimensional modeling is a powerful technique for designing databases that are both easy for business users to understand and capable of delivering fast query performance. By focusing on simplicity and using demoralized star schemas, you can create a DW/BI system that is both robust and flexible. Fact tables store the measurable data, while dimension tables provide the necessary descriptive context, making the data accessible and useful for analysis.
Next, we will delve into more detailed design considerations for fact and dimension tables, ensuring that your DW/BI system is both effective and efficient.