The ORM "Problem"

Wednesday - March 25, 2009

The ORM "Problem"

Here's an interesting point of view. This blog post (from a few years ago) likens ORM to the quagmire of political dogma and stupidity that lead us to the Vietnam war. The Vietnam of Computer Science.

I'll skip the complex historical analogy and cut to the key points.

ORM is not a quagmire of unsolvable problems. It's generally straightforward.

The theoretical ideals of perfect fidelity between the object and relational representations of information is hard in theory but easy in practice. This is because of some false expectations for ways in which relational databases can be used. "In theory there is no difference between theory and practice. But, in practice, there is." Relational databases are used in limited ways, making ORM work quite well.

This posting lifts up a bunch of stuff that are not real issues. In and among all the non-problems are two real concerns.

Impedance Mismatch

There is an object-relational impedance mismatch. Technically relational database is a bunch of disconnected sets of very simple rows of very simple datatypes. The relationships among sets -- technically -- is a very casual foreign-key thing. As a practical matter, the foreign-key relationships are not completely casual.

As a mathematical formalism, FK's can refer to anything -- PK's or attributes. This colors the analysis of ORM by making the mapping appear to be an arbitrary constraint on the relational model. Because of this, it appears that one model dominates and the other is ignored.

Claiming that the relational model is ignored when we focus on objects is incorrect. The theoretical relational model (anything can join with anything) is ignored. But we don't actually use this theoretically open-ended model. We formally declare foreign key relationships, limiting our relational model to an object-like subset of all possible models.

What we actually lose are the rich semantics of the various object collection types. But this is not much of a loss, since we replace them with a relational store that (usually) gives us an efficient key-to-set-of-rows mapping.

[Note, I was once told not to use the Impedance Mismatch phrase. It's a solved problem in electrical engineering, I was told, just use a transformer. And I agreed; "It's solved in computer science also, just use ORM." But it's still a problem and it still requires a tiny bit of design to choose the right transformer.]

Object-to-Table Mapping

Technically, it's the class-to-table mapping or object-to-row mapping.

What's important is that this is -- perhaps -- the hardest part of ORM. Class definitions often form hierarchies that are just difficult to represent in a relational database.

When you think about it, the problem is a weakness in the relational world, and you have to make some concessions to the limitations of the relational database to reflect class hierarchies and disjoint (non-polymorphic) subclasses.

Happily, most ORM's handle inheritance nicely. Django's ORM handles proper inheritance. SQLAlchemy requires some care in doing the mapping between table and class definition, but this allows you to choose among the alternative relational implementations of inheritance.

Associations

The relational association is limited to 1:m foreign keys and nothing more. Some relational database designers forget that it's unidirectional. A child has the FK to a parent; the parent has no notion of children. A relational join resolves the references, but it's done through a search.

The object association is a reference to an object that -- similarly -- is unidirectional. The issue is that is can be in either direction: a child can have a reference to a parent or the parent can have a collection of references to children.

This is not a larger or more complex problem. Relatively simple methods can provide appropriate OO functionality in spite of the limited relational implementation. Further, a little bit of caching in the ORM layer goes a long way to providing pure OO speed in spite of the underlying RDBMS requiring search to populate the cache.

What about changes to the database while things are sitting in cache? This is just as big a problem in a pure SQL world as it is in an ORM world. And it has the same solution -- keep your transactions short and handle deadlock intelligently. We'll return to this, since it's a real problem.

Politics (Who Owns the Schema)

The political shoving between DBA's and developers is nothing new, and not an indictment of ORM. Some developers do lousy database designs. Some DBA's insist on "scent-marking" every design by altering small things or adding lots of stored procedures. It's hard to understand both domains.

ORM does not solve or exacerbate the problem.

It requires a careful OO design to get things absolutely right. Then a mapping to a relational schema that provides a good implementation. And the process of mapping may lead to changes in the OO design.

Some OO designs are naive with respect to independent existence of objects, association references, and persistence. Some OODBMSs can have trouble with too-complex object models and mutual references between objects. This is not an ORM issue -- it's plain old design for persistence and recreation of references.

We'll return to this when looking at the "Object Creation" issue, since this is one of two real issues with ORM.

Dual Schema Definitions

This is the second of the two real issues identified in the "Vietnam of Computer Science". Which comes first, relational schema or object schema?

In Django, they solve this neatly by creating the relational schema from the objects. SQLAlchemy can do this also, but it takes a little more coding to create the required table and class definitions. The table is a schema-level definition; the class is defines each row. Note that the two concepts are separate and are conflated by the SQL CREATE TABLE statement.

Data outlives application software; data is widely shared. In a mixed-language environment (some OO, some non-OO) this can be a problem. Non-OO developers may want schema changes that break the ORM mappings.

In a single-language environment, this is never a problem. In order to prevent problems in a mixed-language environment, one approach must be considered "strategic" and the other must adapt. The non-OO programming language must be treated as second-class. Sorry.

The Entity Identity Non-Problem

Yes, the internal object ID (e.g, the memory address) will not match the relational database primary key. However, keeping the relational PK in the object as an attribute makes this "problem" uninteresting.

Data Retrieval and OQL

This is the most common stumbling block when first approaching ORM. Generally, we don't do ad-hoc queries against objects; we simply navigate. We create hashmaps and lists and other collections.

In the relational world, we get so used to search that the SELECT statement takes on mystic significance.

The bridging, however, isn't really that complex. The solution is to introduce either class-level methods or associated meta-objects that handle search for us. In Django, it's a class-level manager object that handles search and creates the "model instances" -- the row objects. In SQLAlchemy, the table definition handles search and this creates the mapped row objects.

The only ORM complexity here is the confusion created by SQL joins. The result of a join is a new class of objects; a class that's not formally defined in the relational schema. In an ORM world, you must explicitly navigate from an instance of one class to an instance of another class. It feels inefficient to do "extra" I/O for navigating from object to object. Somehow, the overhead of the SQL searches to create (and cache) the join results vanish.

SQL joins shouldn't be seen as a first-class feature that's missing from an ORM. A SQL join is a workaround that creates a new, implied class definition. It's a hack. There, I said it.

The "Partial Object Problem"

As with some other "problems", this is a non-problem. SQL allows creation of an on-the-fly class when you select only a subset of available columns. This can be formalized with a view, or done casually with a select other than "SELECT *".

Selecting a subset of columns doesn't make sense when using an ORM. It's not an interesting feature of SQL. Selecting a subset of columns can be called a "bandwidth" reduction, but you'll be hard pressed to measure the impact. More time is spent in the search than in packaging the columns for transmission back to the application. Eliminating relational search should be the goal.

The Object Load Issue or Who Gets Created First?

As mentioned above, a circular relationship is a bit of a problem. Two objects with mutual references are hard to recover from persistent storage because they each have a reference to the other.

With SQL, you don't have a problem because you don't have "direct" references, just FK references. All object retrieval requires search. In the object world (in memory) objects have "direct" references without resorting to search.

In the ORM world, we have to reintroduce search. One object, when used through ORM, cannot have a direct reference to another object. It has to resort to search. The ORM layer must cache the search results, otherwise, each reference among objects results in a database query.

Two Concerns

After all those words, we have two real concerns.

1. Which schema comes first? RDBMS or Objects? And how do we balance the design considerations between the two?

2. How do we handle object navigation so as to prevent excessive RDBMS queries and also prevent an out-of-date object cache?

The first requires skill and experience to create an object model that works and can map to the subset of object features that make sense in an ORM world. This is easy to do in practice; even if it has lots of theoretical issues.

The most important consideration when looking at the two schema is to clearly define the various persistent entities. The discipline imposed by ORM usually improves the design by asking the designer to think carefully about navigation and creation.

The second problem is an inherent problem in all database designs. If your synchronization between processes is done via locking and deadlock detection, you'll always have problems. Either hold locks while users are staring at their screen or use some versioning technique to determine that the data was changed while you were thinking. Holding locks makes this slow for the person waiting. Versioning requires extra database activity to validate the version prior to a change.

ORM did not make this problem significantly worse. It made the problem a little worse by adding another level of caching outside the database. But the locking/versioning problem still exists.

ORM is not a problem. It is not a quagmire. It is the only practical way to leverage the ubiquitous relational database with an OO language.

Posted: Wednesday - March 25, 2009 at 06:01 AM

Author: Steven Lott
Technorati Tags: ORM RDBMS Python
Technorati Cosmos:

Technorati Watchlist:

Add this entry to: