| date: | 2008-04-27 12:50:40 |
|---|---|
| category: | Data Structures and Algorithms |
It’s not that spreadsheets are evil incarnate, it’s just that there are so many ways to abuse them. Spreadsheets put a veneer of structure over some information. Bad spreadsheet design, worse yet, puts unstructured information into a hard-to-manipulate format.
I’ve complained about this before. See Great Quotes about the Spreadsheet Problemâ„¢ .
Compounding the problem is the casual way that some folks deal with the dimensionality of their data. I’ve started working with researchers. They know the distinction between independent and dependent variables – it’s often part of their experiment design. However, their spreadsheet is just a flat list of columns, concealing the meaning in their data.
There are three hard parts to these kinds of problems:
The design of the database. DB design isn’t always easy to deduce from the spreadsheets. Other programs can (and often do) exacerbate the problem by claiming that a dimensionally normalized database isn’t to helpful – they just need persistent storage. In the long run, that won’t work out.
Inconsistency of the spreadsheets. We can write rules, pass laws, build macros, rant and rave. The spreadsheet is still a flexible desktop tool. We can try to replace it with MS-Access, but the security model is still all-or-nothing. If you have any access, you can add, change or delete anything you want to touch.
Getting data out of an XLS file. There are many, many ways to approach this problem. We might:
Enter XLRD
It turns out that the Open Office organization has managed to reverse engineer enough of the mystery that is XLS files. This has two important consequences: the Apache POI project and the Python xlrd package.
With XLRD I can read the XLS file directly, cutting down on the user’s busy-work to save the file in a usable format. Using Python allows me to work through the consistency problem. This usually leads to a class hierarchy which handles variations on the expected spreadsheet.
A bonus is that we can unify XLRD-based readers with XML parsers to deal with a wide variety of spreadsheet data sources.
We’re still left with the remaining two hard parts – DB design and consistency. With Python we can easily build something meta-data driven. In particular, we can use Python’s introspection capabilities to have flexible high-level mappings from various kinds of spreadsheets to the RDBMS tables.