Data Warehouse Design Patterns

Steven F. Lott, Consultant

1. Objectives and Background

  • Discuss patterns (briefly)
  • Two detailed examples
    • Warehouse
    • Background Query Processing
  • Eight summary examples

Background

A successful Data Warehouse implementation, acromym is “ACE”

  • Physical Size
  • Usage
  • Performance
  • History
Physical Size

Central fact table: over 120 million records

Activity on 6 million financial accounts

Size is about 400Gb

  • We use Oracle/RDB compression to keep it small

Complex Snowflake schema

Usage

About fifty users

Average of 400 queries per week, retrieving 2.1 million rows.

Bursts of 600 queries per week fetching 10.6 million rows.

All queries ad-hoc

  • No aggregation in advance!
Performance

Average performance per query

  • 8500 rows each
  • 26 minutes each

Average elapsed time per row returned

  • 0.9 seconds each
History

Internal statistics warehouse

  • Application Managament
  • Over 118 weeks of data
  • Over 46,000 queries
  • all tables and columns referenced by all queries ever run in production

2. Patterns and Pattern Language

Provide a name to standardize discussion

Describe a problem

Describe a solution

Enumerate the consequences of design decisions

Why use them...

Why Use Patterns?

Provide a common framework for describing problems and solutions

Easy to catalog successful approaches

Easy to make incremental improvements

Pattern Language includes a template for describing a pattern

Pattern Language

Intent - purpose of the pattern,

Aliases - other names,

Motivation - background and justification for this approach,

Applicability - circumstances for use,

Structure - object model diagram,

Participants - summary of classes involved,

Collaborations - clients of and services to,

Consequences - benefits and liabilities,

Implementation - issues to be solved as part of the implementation

Related Patterns - other patterns which are similar to, incorporated in or which would incorporate this pattern,

References - my own experience

Sample Code - skipped

Known Uses - ACE warehouse

3. Warehouse Example Patterns

Data Warehouse

Star Schema

Snowflake Schema

Warehouse GUI

Background Query Processing

Warehouse Load Utility

Warehouse Statistics

Restricted Update Interface

Metadata Repository

Data Mining

Data Warehouse

Intent

  • want to keep history of facts on completed business transactions
  • no updates – only insert and retrieval

Aliases

  • OLAP, Decision Support System, Executive Information System

Motivation

  • Decision support different from day-to-day business operations.
  • Historical analysis different from decision support.
  • Relational database performance degrades with the size of the database.
  • Contrast warehouse (On-Line Analytical Processing–OLAP) vs. operational (On-Line Transaction Processing–OLTP)

Applicability. Use a Data Warehouse when:

  • Keeping historical data
  • Retaining accurate, uniform history.

Structure

  • Data Warehouse is an aggregate of one or more Star Schemas in the role of User-DB and Maint-DB.
  • User-DB is a copy of Maint-DB, provided to users to keep availability close to 24x7.

Attributes

  • two operating states: user retrieval or maintenance - a number of products, i.e. data from operational systems. Each product has state with date of most recent load - meta-data with source and transformations

Operations

  • Fact Load - Fact Proof - Dimension Maintenance - Clone. We clone the maintenance database to create a user database - User Query

Participants

  • warehouse is only an aggregate of star-schemae - user GUI is the most visible component - a background query processor permits users to do other things on their desktop (MS-Windows limitation) - load utilities do transformations; they may automate dimension maintenance - log everything in an administrative database - trusted user interface for database loads; no informal developer access

Collaborations

  • fed from operational systems - feeds data marts - user GUI may be a collaborator instead of a participant. May be outside the warehouse – user-selected tools - metadata maintenance may be collaborator or participant; often meta-data kept separate from warehouse (we use excel spread-sheet!)

Consequences

  • remove (expensive) history from operational databases - warehouse data must be final. What about adjustments? - to get availability to 24x7 clone maintenance DB to make user DB. Our database can be cloned in 4 hours

Implementation. Solutions will be required for all pattern elements

  • Star Schema - User GUI - Background Query Processing - Administrative DB - Trusted User Interface - Fact Load - Warehouse Metadata

Related Patterns

  • Operational Database
  • A Warehouse GUI
  • Background Query Processing
  • Warehouse Load Utilities
  • Warehouse Statistics
  • Warehouse Operator Interface

Star Schema

Problem

  • permit slicing and dicing by various keys

Solution

  • prejoin numeric facts with all relevant dimensions

Consequences

  • complex prejoined load

Snowflake Schema

Problem

  • dimensions are complex

Solution

  • code dimensions with surrogate key
  • dimension table has attributes

Consequences

  • complex dimension maintenance during loads

Warehouse GUI

Problem

  • support end-user ad-hoc query

Solution

  • show warehouse state, construct queries, submit for processing, collect results

Consequences

  • repository for query construction, background query processor

Background Query Processing

Intent

  • How to remove the processing load from the user desktop.
  • Permit compression of results before sending back to the desktop.

Motivation

  • user desktop limitations:
    • may not be multi-tasking
    • may not be reliable enough for long-running queries
    • may not be secure enough to leave unattended
  • DB server may not have enough performance to do long-running queries
  • network connection may not be suitable for large query results

Applicability. Use a Background Query Processor when:

  • desktop unsuitable for long-running queries

Structure

  • a queue of SQL requests
  • a query executor; creates results, reports status
  • a results manager; compress, cache and transmit

Participants

  • queue of SQL requests (SQL and start time) states: waiting, ready-to-run, finished.
  • query executor gets SQL, establishes a DB connection, runs the SQL query, collects status.
  • results manager compresses, caches and transmits results.

Collaborations

  • queue fed from warehouse GUI
  • status feed to warehouse GUI
  • results compressed and transferred to file server to users workstation

Consequences

  • protocol to enqueue a query; we insert SQL text into a table which represents the query queue.
  • protocol to fetch results; we use PKZip and FTP
  • protocol to show status; we put status into rows of a table
  • background control will be large and complex
  • multiple subtypes of background executions
    • ad-hoc vs. canned production
    • different result formats (depends on desktop tools)

Implementation

  • pick a mechanism to enqueue queries
  • pick a mechansim to execute the query; we use a sleep loop to poll for work
  • pick a mechanism to reply with status
  • pick a mechanism to send back results; our GUI is Business Objects; we reverse engineered their mechanism so we could replace it with PKZIP and FTP transfers

Related Patterns

  • Star Schema will be queried by background processes running on a query server.
  • Warehouse GUI enqueues queries, checks status and retrieves results.
  • record all queries and all performance-related data in the Warehouse Statistics DB.

Warehouse Load Utility

Problem

  • Move normalized data to denormalized star or snowflake schema

Solution

  • Standardized load utilities

Consequences

  • Unique subclass for each legacy system

Warehouse Statistics

Problem

  • Analyze performance and usage

Solution

  • Keep load and query processing statistics

Consequences

  • Need meta warehouse for statistics
    • Need to parse ad-hoc SQL

Restricted Update Interface

Problem

  • must assure quality of warehouse data

Solution

  • controlled, trusted, production update process
  • user proof and sign-off

Consequences

  • no developer or user write access to DB

Metadata Repository

Problem

  • what is meaning and provenance of data?
  • maintenance of extracts, loads, etc.

Solution

  • repository of meta data to describe source, business rules, and warehouse location

Consequences

  • must be kept up-to-date and used for all maintenance

Data Mining

Problem

  • users don’t have a useful business model

Solution

  • look for correlations among facts

Consequences

  • haphazard queries

4. Conclusion

Patterns capture knowledge

Patterns ease learning

Patterns facilitate incremental improvement

Warehouse is built from standard patterns

  • Star Schema
  • Warehouse GUI
  • Load Utilities
  • etc.

Pick and choose to make your own implementation

Patterns Leverage Experience

  • Focus on elements
  • Make improvements
  • Not overwhelmed by whole architecture
version:1
date:1999