Versioning Databases – Branching and Merging

Monday, February 4, 2008

Previous entries:

  1. Three rules for Database Work
  2. The Baseline
  3. Change Scripts
  4. Views, Stored Procedures and the Like

If you are not familiar with branching and merging, I'd suggest Eric Sink's excellent Source Control HOWTO. My preferred branching practice is "branch for release". New development work takes place in the mainline, or the trunk of the repository. As a product nears release, schema changes become rare and code churn slows. At this point, the team creates a branch for the release. New feature development resumes in the trunk with schema changes and all. The branch receives only bug fixes. The team generally merges fixes from the branch into the trunk.

How do schema changes work in this scenario?

In The Words Of Run-D.M.C.

It's tricky.

Schema changes in a branch require some care and thought. The technique I use of forward-only, run--once schema change scripts has some issues when it comes to branches. This is certainly an area you'll want to think about when devising your schema management process.

In my case, I bet that schema changes in a branch will be rare. Like most things in our field, it's all about the tradeoffs. If you use a more aggressive branching strategy (like branch per feature), this approach doesn't work so well. Other strategies (like Ruby Migrations) can move forward or backward with changes (as you don't make irreversible changes).

The Scenario

Let's say the team baselined the database at version 01.00.0000, and wrote 45 change scripts during feature development (01.00.0001 to 01.00.0045). Feature development is done, and the product is near a 1.0 release, so a 1.0 branch is created (note: the schema version and database version don't need to coincide at all).

At the point of branching, I create a new baseline for the database. In this case, a baseline of 01.01.0000 or 02.00.0000 is fine - I just want to rev either the major or minor version number. Let's say we use 02.00.0000 for the purpose of discussion. This new baseline goes into the trunk. All new database installs (from the trunk) just run this new baseline script, as it should produce the same database as the original baseline + the 45 change scripts. I also check in an empty 02.00.0000 schema change script so that all existing databases, when updating to a new build on this code, are now "version 2" databases.

Now, imagine the team working ahead and creating 2 schema change scripts in the trunk. These changes are 02.00.0001 and 02.00.0002. At this point a bug is found in the branch, and the bug requires a schema change to fix. Ugh!

Back in the branch, the team creates schema change 01.00.0046, and fixes the bug with a combination of code and change script. All is well for production type systems that are only receiving stable builds, because those databases have never seen a v2.0 change script. We simply update those systems with the new v1.0 build. The branch build includes and applies the 46th change script. All is well, at least in the world of 1.0.

What About the Mainline?

To get this fix into the mainline, there are two options. Well, actually there are an infinite number of options to consider depending on how you apply your updates, but here are two options:

  1. Merge the schema change script into the mainline as 01.00.0046, and fix the 2.0 baseline script to incorporate this change.
  2. Write a new schema change script, 02.00.0003, that has the same changes as change 46 in the branch.

With option #1 you have to be careful because any database that updated to v2.0 will not take the 46th change script from the branch (unless you write your tools differently than I do). You have to force people to run this script manually, or you go around destroying any existing v2.0 databases (which at this point. should only be on development and test machines anyway). This is not a great option, but if you are not deep into 2.0 it is sometimes viable.

Option #2 is a bit friendlier. The v1.0 databases will pick up the fix from 01.00.0046. The v2.0 databases will pick up the fix from 02.00.0003. You have to be careful though, to write the 02.00.0003 change script so that it won't try to reapply changes if the 01.00.0046 script ran.

In other words, databases installed from the v2.0 baseline script need to apply the 02.00.0003 script, but production type databases that have been around since 1.0 will use the 01.00.0046 script, and you don't want 02.00.0003 to create an error by making changes that are already in place when the database eventually updated to v2.0.

Whew, I hope all that makes sense. Branches and schema changes always make me nervous, but fortunately, they are rare. Even when they do occur, the change scripts usually involve simple changes and not the kind of big changes you see in a script written during feature development.

Questions

In a previous post, Jason asked:

  1. When you add a column to a table, it really needs to be scripted as an update script for existing databases and also changed in the "create entire database" script so that deploying a new database doesn't require the entire history of update scripts to be applied too. This synchronisation is tricky to enforce.
  2. Personally, I am very comfortable writing a change script to rename an existing column, change it from varchar to int also, and maintain all the existing data too. Not everyone on the team is. Most auto-scripting tools won't get these more complex changes right. Delegating one guy on the team as the DB guy isn't good when he's on leave or otherwise unavailable.

#1: I generally only baseline between releases. This means new installs might have to run a heap of update scripts, but since they are new installs and the process is automated – it's not a big deal. The benefit is I don't have to keep baseline scripts synchronized, which is one less chore to perform when updating the schema. When creating a baseline, it's easy to use a SQL diff tool to make sure the baseline is identical to a database created by running the previous baseline plus all the update scripts.

#2: The change scripts many tools build often don't work for me because of the size of the databases I work with. Some of these scripts will run for hours and hours. Most of the change scripts are hand written with loving care. It's great to have a few developers around who are good with SQL – I know not every team has this luxury, and I honestly don't have a good answer to that problem.

I Think It's Over ... For Now

I've glossed over many details, but I think this is all I have to say on the matter of database versioning for now. Hopefully, these short blog posts made some sense.

The goal of versioning a database is to push out changes in a consistent, controlled manner and to build reproducible software. Since the change scripts are all version controlled and labeled with build numbers, you can create any past version of the database that you need.

My goal in writing these posts was to get people who don't version control their database to think about starting. If you already version control your database, I hope these posts have validated your thinking, or given you ideas. It could be that you already have a much better process in place, in which case you should try to blog about your process, because there is a dearth of information on the Internet about this topic. Better yet – write an article or a book, but hurry up, we need to hear from you!


Comments
Maeghith Monday, February 4, 2008
I'm talking out of my head which means I haven't tested anything I'll say but, I'll say it anyways.

This has been a great series but I think the process of versioning the updates and the cumbersome branching mechanism you described is because you are trying to override what the versioning tool is intended to do (incremental diffs and patching/branching).

The procedure coming to my head right now is to put the create table on a file, baseline that, and then append there all the update sentences (registering on VC when appropriate) to that same file. Then the data to store on the SchemaChangeLog table instead of the filename, will be the revision number (or change hash if that's the case) where the update was added to the file. Then to rollback or integrate branches you'll use the procedures set by your VCS. But then I'm not sure how this will affect the rest of the procedure you described, because I haven't really thought about the consequences.

Hmm... another posibility is to wrap each update on a stored procedure that check if the update has already been applied before commiting it to the DDBB. So you'll need to assign a unique ID (maybe a hash or GUID) to each update and check for it which means a little more metadata to the SchemaChangeLog table (I'll have to take a look to patch theory because I think that looks quite similar to what git, darcs, bazaar-ng, or mercurial do :) ).
chris Monday, February 4, 2008
we solved the problem of syncing changescripts(migrations) across branches by moving all of that code out to an svn:external which is shared amongst all devs on all branches with all changescripts tied to a ticket number(where every ticket is a branch.) this works for us, but could be improved.
scott Monday, February 4, 2008
@Maeghith:

I like the first approach, but I'm afraid it might not be that easy. What happens if I append an FKEY constraint in TABLEA that depends on an column being added to TABLEB first?

It seems like it would be difficult to get the order of change operations applied correctly.
Werner Tuesday, February 5, 2008
Similar principles but a different approach you can find in my blog in jottings #3 and #10.
Maeghith Tuesday, February 5, 2008
@scott:

Well, as the script is usually executed sequentially (just like the sequence of files you had) the way to resolve that issue will be similar: write first what has to be applied first; as everything gets appended to the file the order will be kept (well, actually by numbering files you could write whatever thing first, if the names are right the writing order won't matter).

Anyway, on the other hand that was the idea until now that I realize the resulting script is only useful for clean deployments of the database (What was I thinking about when I wrote that comment?): As everything is in a single file, applying the create table and following updates on a DDBB that already has the table/changes applied will (usually) fail, and the script will (usually) stop.
Alain Tuesday, February 5, 2008
@Maeghith:

I think you're on to something. Having only 1 file and letting the VCS control the versioning makes sense to me. Maybe the revision # can be written to the sql script so that the sql script looks at the SchemaChangeLog table and skip already applied sections instead of having the custom tool do it. Just thinking out loud.

This is a great blog!
Pradeep Tuesday, March 4, 2008
Hi Scott,

Thanks for this informative blog.

I want to understand why you have created two baselines of the database. Shouldn't baselines be created once?

thanks.
gravatar Joshua Zeidner Wednesday, August 4, 2010
great article, I deal with these issues constantly.
Marty Thursday, November 4, 2010
I prefer to use sortable dates for the change number rather than M.m.R.B numbers... makes things a bit easier when merging changes between multiple developers.
Comments are now closed.
by K. Scott Allen K.Scott Allen
My Pluralsight Courses
The Podcast!