Versioning Databases – Branching and Merging

Monday, February 4, 2008

Previous entries:

If you are not familiar with branching and merging, I'd suggest Eric Sink's excellent Source Control HOWTO. My preferred branching practice is "branch for release". New development work takes place in the mainline, or the trunk of the repository. As a product nears release, schema changes become rare and code churn slows. At this point, the team creates a branch for the release. New feature development resumes in the trunk with schema changes and all. The branch receives only bug fixes. The team generally merges fixes from the branch into the trunk.

How do schema changes work in this scenario?

In The Words Of Run-D.M.C.

It's tricky.

Schema changes in a branch require some care and thought. The technique I use of forward-only, run--once schema change scripts has some issues when it comes to branches. This is certainly an area you'll want to think about when devising your schema management process.

In my case, I bet that schema changes in a branch will be rare. Like most things in our field, it's all about the tradeoffs. If you use a more aggressive branching strategy (like branch per feature), this approach doesn't work so well. Other strategies (like Ruby Migrations) can move forward or backward with changes (as you don't make irreversible changes).

The Scenario

Let's say the team baselined the database at version 01.00.0000, and wrote 45 change scripts during feature development (01.00.0001 to 01.00.0045). Feature development is done, and the product is near a 1.0 release, so a 1.0 branch is created (note: the schema version and database version don't need to coincide at all).

At the point of branching, I create a new baseline for the database. In this case, a baseline of 01.01.0000 or 02.00.0000 is fine - I just want to rev either the major or minor version number. Let's say we use 02.00.0000 for the purpose of discussion. This new baseline goes into the trunk. All new database installs (from the trunk) just run this new baseline script, as it should produce the same database as the original baseline + the 45 change scripts. I also check in an empty 02.00.0000 schema change script so that all existing databases, when updating to a new build on this code, are now "version 2" databases.

Now, imagine the team working ahead and creating 2 schema change scripts in the trunk. These changes are 02.00.0001 and 02.00.0002. At this point a bug is found in the branch, and the bug requires a schema change to fix. Ugh!

Back in the branch, the team creates schema change 01.00.0046, and fixes the bug with a combination of code and change script. All is well for production type systems that are only receiving stable builds, because those databases have never seen a v2.0 change script. We simply update those systems with the new v1.0 build. The branch build includes and applies the 46^th change script. All is well, at least in the world of 1.0.

What About the Mainline?

To get this fix into the mainline, there are two options. Well, actually there are an infinite number of options to consider depending on how you apply your updates, but here are two options:

Merge the schema change script into the mainline as 01.00.0046, and fix the 2.0 baseline script to incorporate this change.
Write a new schema change script, 02.00.0003, that has the same changes as change 46 in the branch.

With option #1 you have to be careful because any database that updated to v2.0 will not take the 46^th change script from the branch (unless you write your tools differently than I do). You have to force people to run this script manually, or you go around destroying any existing v2.0 databases (which at this point. should only be on development and test machines anyway). This is not a great option, but if you are not deep into 2.0 it is sometimes viable.

Option #2 is a bit friendlier. The v1.0 databases will pick up the fix from 01.00.0046. The v2.0 databases will pick up the fix from 02.00.0003. You have to be careful though, to write the 02.00.0003 change script so that it won't try to reapply changes if the 01.00.0046 script ran.

In other words, databases installed from the v2.0 baseline script need to apply the 02.00.0003 script, but production type databases that have been around since 1.0 will use the 01.00.0046 script, and you don't want 02.00.0003 to create an error by making changes that are already in place when the database eventually updated to v2.0.

Whew, I hope all that makes sense. Branches and schema changes always make me nervous, but fortunately, they are rare. Even when they do occur, the change scripts usually involve simple changes and not the kind of big changes you see in a script written during feature development.

Questions

In a previous post, Jason asked:

When you add a column to a table, it really needs to be scripted as an update script for existing databases and also changed in the "create entire database" script so that deploying a new database doesn't require the entire history of update scripts to be applied too. This synchronisation is tricky to enforce.
Personally, I am very comfortable writing a change script to rename an existing column, change it from varchar to int also, and maintain all the existing data too. Not everyone on the team is. Most auto-scripting tools won't get these more complex changes right. Delegating one guy on the team as the DB guy isn't good when he's on leave or otherwise unavailable.

#1: I generally only baseline between releases. This means new installs might have to run a heap of update scripts, but since they are new installs and the process is automated – it's not a big deal. The benefit is I don't have to keep baseline scripts synchronized, which is one less chore to perform when updating the schema. When creating a baseline, it's easy to use a SQL diff tool to make sure the baseline is identical to a database created by running the previous baseline plus all the update scripts.

#2: The change scripts many tools build often don't work for me because of the size of the databases I work with. Some of these scripts will run for hours and hours. Most of the change scripts are hand written with loving care. It's great to have a few developers around who are good with SQL – I know not every team has this luxury, and I honestly don't have a good answer to that problem.

I Think It's Over ... For Now

I've glossed over many details, but I think this is all I have to say on the matter of database versioning for now. Hopefully, these short blog posts made some sense.

The goal of versioning a database is to push out changes in a consistent, controlled manner and to build reproducible software. Since the change scripts are all version controlled and labeled with build numbers, you can create any past version of the database that you need.

My goal in writing these posts was to get people who don't version control their database to think about starting. If you already version control your database, I hope these posts have validated your thinking, or given you ideas. It could be that you already have a much better process in place, in which case you should try to blog about your process, because there is a dearth of information on the Internet about this topic. Better yet – write an article or a book, but hurry up, we need to hear from you!

All My Pluralsight Courses

OdeToCode by K. Scott Allen