Schema Migration: A Toolkit Approach

Steven F. Lott

slott56@gmail.com

http://slott-softwarearchitect.blogspot.com/

http://www.itmaybeahack.com/book/schema/schema_migration_talk.html

Topics

Problem

Solution

Conclusion

About Schema

Logical schema has formal documentation in SQL.

Where is the meaning described?

Why We Care

Waterfall methods do few schema migrations.

Agile methods do a lot of schema migrations.

Why This is Hard

Managers often treat schema migration as an after-thought.

It's actually quite difficult.

Schema Gaps

The conceptual schema is obscure.

Five common knowledge gaps...

Use Case Incompleteness

What do users really do?

MP900401598.JPG

The Closed World Assumption

If it's not in the database, it doesn't exist. Right?

MP900446977.JPG

Fundamental Semantic Murkiness

Sales "Customer" vs. Support "Customer" vs. Marketing "Customer"

MP910222535.JPG

Specification and Design Process Errors

Stuff Happens

MP900399496.JPG

Implementation Details

Is the code really part of the schema?

MP900449113.JPG

Bewilderment Factors

SAD Management Practices

Sloppy Anonymous Defaults.

Enough of Why It's Hard

What are we trying to do?

How should we approach it?

Typical Schema Evolutions

Changes to a Schema include...

In various combinations

Schema Evolution

Some Schema changes are not simple mechanical changes.

They can go way beyond Create, Drop and Alter.

For example, Table Normalization

  • Old table split to two (or more) new tables.
  • May involve complex SELECT DISTINCT query.
  • May involve creating many-to-many association table.

Difficult to generalize.

Schema Evolutions, Part II

Wait.

The Schema is Incomplete. The meaning is obscure.

Or totally unknown.

What are we thinking?

We can't automate something we can't represent.

Can we?

Schema Migration Tools

IMG_0142.JPG

An Attractive Nuisance.

Without the conceptual schema.

What We Can Do

Six Disciplines

Toolkit Approach

Testability

Six Disciplines

Release Identification

No vague "Production". A specific numbered release.

Enable conversations about other releases and other purposes

Data Stewardship

No vague "Production". A specific owner with a specific business purpose.

Enable conversations about other data and owners

Configuration Control

Use Links, Environment Variables, Config Files.

Not vague "Production".

Allow conversations about other software releases; other data

Why?

Schema Migration can get done many times.

Also.

Production Upgrade

The in-place production upgrade — rebuilding an existing database — is a mistake waiting to happen.

Just don't.

Always think of schema migration as a copy

Accompanied by config changes to point apps to version x+1 DB.

Toolkit Approach

Data Movement Strategies

Script Organization and Naming

Populate Package

Validate Package

Create and Finalize

Data Movement Strategies

No "best" answer.

ETL is usually faster and more flexible and auditable and recoverable

How Do I Do It?

Glad you asked.

Script Organization and Naming

Five standard Python packages:

  1. prevalidate. Confirm old-schema data can be moved. Just Enough.
  2. create. Create new schema from scratch. All the DDL. All.
  3. populate. Extract-Transform-Load.
  4. validate. Confirm new schema data was moved.
  5. finalize. Build indexes, views; add constraints.

__main__.py to run the whole sequence.

May run validate periodically as data quality check.

Python Package

python3.2 -m somename

python2.7 -m somename.__main__

Overall __main__.py Script

import prevalidate.__main__
import create.__main__
import populate.__main__
import validate.__main__
import finalize.__main__
def main():
    prevalidate.__main__.main()
    create.__main__.main()
    populate.__main__.main()
    validate.__main__.main()
    finalize.__main__.main()
if __name__ == "__main__":
    main()

Overall __main__.py Script (alt)

import runpy
def main():
    for s in 'prevalidate', 'create', 'populate', \
    'validate', 'finalize':
        runpy.run_module(
            s+'.__main__',
            run_name="__main__", alter_sys=True)
if __name__ == "__main__":
    main()

Details

We'll look at populate first, since it's the most important.

prevalidate and validate second, because they're somewhate easier.

create and finalize last, since they're easiest. They're mostly SQL.

Populate Package

A bunch of Populate modules.

Populate Simple Case

Adding or removing columns is the easy case.

Populate Complexities

A "table normalize" creates n tables from one.

A "table denormalize" may create one table from many.

Populate Module

from __future__ import print_function
import csv
import some_database
from contextlib import closing

def extract(): ...
def load(): ...

def main():
    extract()
    load()
if __name__ == "__main__":
    main()

def extract():

query="""SELECT M-1 AS M1, Y+2000 AS Y2, Z
FROM SOME_TABLE"""
db2= some_database.connect( ... )
with open("data/file1.txt", "w") as target:
    wtr= csv.writer( target )
    with closing( db2.cursor() ) as c:
        c.execute( query )
        headings = [ col[0] for col in c.description ]
        wtr.writerow( headings )
        wtr.writerows( c.fetchall() )

Query and filename can be parameters.

def load():

stmt= """INSERT INTO SOME_TABLE(MY,Z) VALUES(%d,%s)"""
db3= some_database.connect( ... )
with open("data/file1.txt", "r") as source:
    rdr= csv.DictReader( source )
    with closing( db3.cursor() ) as c:
        for row in rdr:
            c.execute( stmt,
            [row['M1']*12+row['Y2'], row['Z']] )
    db3.commit()

The new value calculation difficult to parameterize

Populate Review

Collection of ETL modules

Each module has

Overall __main__ to build the whole database

Validate Package

A bunch of validate modules.

  • Typically independent conditions.
  • Each condition module has a main() function.
  • This function raises an exception when a row is bad.

A __main__.py runs them all at the same time.

  • Use a multiprocessing.Pool to farm out the workload.

One Exception ?!?!

Yes. One. You want to know if the migration worked

This is not a comprehensive data quality audit.

If you need a data quality audit (or data profiling),

Validate __main__.py, Part 1

from __future__ import print_function
import multiprocessing
import sys

def validate( func ):
    """Wrap the exception-raising rule function."""
    try:
        func()
        return True
    except Exception, e:
        print( e )
        return False

Validate __main__.py, Part 2

import cond1, cond2, cond3

def main():
    rules = cond1, cond2, cond3
    vpool= multiprocessing.Pool(processes=len(rules))
    results= vpool.map( validate, (r.main for r in rules) )
    print( zip( rules, results ) )
    if not all(results):
        sys.exit( 2 )

if __name__ == "__main__":
    main()

Simple Condition Module

import some_database
from contextlib import closing

def condition( row ): ...

def main():
    db= some_database.connect( ... )
    with closing(db.cursor()) as c:
        c.execute( "SELECT... FROM..." )
        for row in c.fetchall():
            assert condition(row), "FAIL {0!r}".format(row)
if __name__ == "__main__":
    main()

Data Quality Audit

Small change to replace assert...

for row in c.fetchall():
    if not condition(row):
        log.error( "FAIL {0!r}".format(row) )

Before and After Balance Checks

Two Parts: Old Schema and New Schema

In the prevalidate package for old schema:

In the validate package for new schema:

Pre-Validate Balance main()

query="""SELECT COUNT(*) AS SOME_TABLE_COUNT
FROM OLD_TABLE"""
db= some_database.connect( ... )
with open("data_3/some_table.json", "w" ) as output:
    with closing(db.cursor()) as c:
        c.execute( query )
        headings= [ col[0] for col in c.description ]
        expected= dict( zip(headings, c.fetchone()) )
    json.dump( expected, output )

This serializes a dictionary of counts.

Validate Balance main()

query= """SELECT COUNT(*) AS T1_COUNT
FROM NEW_TABLE"""
db= some_database.connect( ... )
with closing(db.cursor()) as c:
    c.execute( query )
    headings= [ col[0] for col in c.description ]
    actual= dict( zip(headings, c.fetchone()) )
with open("data_3/some_table.json", "r" ) as input:
    expected= json.load( input )

assert actual['T1_COUNT'] == expected['SOME_TABLE_COUNT']

Confirm the actual counts match expected in the serialized dictionary.

Validate and PreValidate Review

Collection of condition modules

Each module has

Modules can all be run independently

Overall __main__ to run all validations concurrently

Create and Finalize

Execute DDL scripts.

Nothing more than that.

Command-Line Tool

from __future__ import print_function
import sys
import subprocess
def main():
    status = subprocess.call(
        ["mysql", "--user=...", "--password=...",
        "app_3", "script.sql"] )
    if status != 0:
        sys.exit( status )
if __name__ == "__main__":
    main()

Execute DDL Script 1 of 2

import some_database
from contextlib import closing
def statement_iter( aFile ): ...
def main():
    db= some_database.connect( ... )
    with open( "script.sql", "r" ) as script:
        for stmt in statement_iter( script ):
            with closing(db.cursor()) as c:
                c.execute( stmt )

if __name__ == "__main__":
    main()

Execute DDL Script 2 of 2

def statement_iter( aFile ):
    buffer= []
    for line in (l.strip() for l in aFile):
        if line.endswith(';'):
            buffer.append( line[:-1] )
            yield "\n".join( buffer )
            buffer= []
        else:
            buffer.append( line )
    if buffer:
        yield "\n".join( buffer )

May need to skip \\g and other non-SQL.

Testability

How do we test schema migration?

Compare fin_migrate_3.2 with fin_work_4.1

What About Unit Testing?

Of course.

Each individual validate or populate module must be unit tested.

These tests often seem trivial and silly.

However. The real-world data in the working databases is hard to foresee.

Unit testing may expand into data profiling and data quality audits.

Test Strategy

  1. Clone the database you're going to migrate. Copy fin_work_#fin_mtest_# (or fin_tmigr_#).
  2. Run the prevalidate package on cloned data until you're sure it works. This will also produce balances for use by validate.
  3. Test create and populate packages. Save the database from populate for other testing.
  4. Test validate on new database until you're sure it works.
  5. Test finalize on the new database. This is low risk, mostly SQL.

Validation can get ugly; stick with it until it's right.

What About?

Cloning a database for testing is slow, expensive, complex and requires a lot of storage.

Making a mistake in schema migration is slower, more expensive, more complex. But it requires less storage.

Conclusion

Why is this hard?

Three reasons.

What can we do?

Six Disciplines

Toolkit Approach

Why is this hard?

Conceptual Schema Gaps.

Bewilderment: Bygone Applications, Brand-New Applications, Bullfeathers, Buffoonery, Bad Management and Bugs.

Sloppy Anonymous Defaults: vague "Production" label.

What Can we Do?

Six Disciplines

Five Step Toolkit

Python is the framework. Batteries Included.

Questions?