Data Warehouse to Data Mart

Steven F. Lott

Agenda

  • Background
  • The Evolving Business Case
  • Expanding into A Datamart
  • Results

1. Background

A successful Data Warehouse implementation

  • Physical Size
  • Usage
  • Performance
  • History of Actual Use

Physical Size

  • Central fact table: over 170 million records
  • Activity on 6 million financial accounts
  • Effective Size is about 400Gb
    • We use Oracle/RDB compression to keep it smaller than this
  • Complex Snowflake schema

Usage

  • About fifty users
  • Average of 400 queries per week, retrieving 2.1 million rows.
  • Bursts of 600 queries per week fetching 10.6 million rows.
  • All queries ad-hoc
    • No aggregation in advance!

Performance

  • Average performance per query
    • 8500 rows each
    • 26 minutes each
  • Average elapsed time per row returned
    • 0.9 seconds each

History of Actual Use

  • Internal statistics warehouse

    • “Application Management” discipline
    • Over 118 weeks of data
    • Over 46,000 queries
    • All tables and all columns referenced by all queries ever run in production
    • Weekly merge from admin into stats DB

2. Evolving Business Case

  • Preliminary Analysis
  • Detailed Analysis
  • Current Business Use
  • Open Issues
  • Future Business Use

Preliminary Analysis

  • Four broad categories of financial products

    • Mortgages, Installment Loans, Credit Cards and Cash Reserve accounts
  • About 12 operational systems

  • User Requirements:

    • “Yeah, we might need that, include it, also.”

Detailed Analysis

  • Over 1000 data items

  • Fairly simple set of measures

    • Outstanding debt

    • Recent Payments

    • Complex set of dimensions

    • 100’s of attributes

      • Some well understood, many very murky

Users & Operational Systems

  • Don’t insulate the users from operational details

  • Remember your purpose

  • Data → Information → Decision → Action

  • The warehouse doesn’t provide reports!

    • It supports actionable decisions

Users & Operational Systems

  • Provide a way to feed operational special cases through the warehouse

– Operational systems are full of unused and misused features.

  • No one has 3 Christmas Club accounts
  • So the xmas_club_3 field is now used to track accounts created due to live-check promotions
  • This may be graceless, but it’s a fact

Bottom Line: Metadata

  • You will need very detailed meta data

  • A simple Filename, byte offset, domain map is inadequate

  • You must capture detailed semantics

  • You must guard against the evil of

    • Semantic Heterogeneity
    • “I call them customers, you call them accounts”

Current Business Use

  • Query results put into Excel spread sheets
  • Query results put into Access databases
  • Additional dimension information included via Access
  • New breakdowns invented on the desktop
  • Large datasets downloaded to users

Open Issues

  • Match between dimension definitions in warehouse and on desktop (in MS-Access)
  • Long-running warehouse queries (large result sets)
  • Long, complex desktop processing
  • Desktop maintenance of history
  • Multiple copies of (nearly) identical queries

Future Use

  • User access to dimension information

    • No need to keep private dimension facts
    • No need for complex desktop processing
    • No need for large result sets
  • Production queries

    • Reduce (nearly) identical queries
    • Additional time-domain data

3. Expanding into A Datamart

  • What’s Wrong?
  • Why is a Datamart a Fix?
  • Preliminary Analysis
  • Detailed Analysis
  • Implementation

What’s Wrong?

  • The warehouse is large
  • The warehouse is complex
  • The warehouse lacks business focus
  • Users have created a desktop nightmare to overcome these problems
  • We must provide a better (simpler, faster, more accurate, more flexible) solution

Why a Datamart?

  • Warehouse has multiple users

  • Warehouse has a "generic" business focus

  • Warehouse works correctly

    • It isn’t broke, but it isn’t working
    • Datamart is a business-focused subset of the warehouse

Preliminary Analysis

  • 118 reports produced by one department

  • Some reports are so complex only key individuals produce them

    • And they recently quit!
  • Other reports have interesting discrepancies

  • Some reports include manually fed data

  • Some reports include adjustments

Detailed Analysis

  • About 24 queries produce all reports

  • Complex reports involve non-standard processing

    • non-star-schema joins
    • additional dimension information
  • 4 new (simple) aggregates required

    • Some new dimensions

Implementation

  • Add extract and rollup process to warehouse load to produce aggregates

  • Not always the right solution

    • Ours takes 24+ hours, once each month

      • YMMV
  • Add to Business Objects repository

  • Reconcile Dimensions

    • Desktop - Datamart - Warehouse

4. Results

  • Heavy use of datamart objects

  • Able to produce comprehensive reports 24 hours after load

    • With very fast queries against aggregate
  • All Private dimension information integrated into datamart

  • Partially integrated into warehouse

Results

  • User changes to dimensions now visible

    • Fewer mystery totals on reports
  • Use of aggregates standardizes results

  • Fewer rogue analysts in user department

  • Demand for “adjustment” to match G/L data

    • down to the dollar
    • unprecedented accuracy

Bottom Line

  • Warehouse provides bulk data

  • Datamart provides needed business rules

    • Transform data into information

      • Need both to make actionable decisions

      • Decision-making is what a warehouse or datamart supports

        • not reporting
version:1
date:1998