Schema Refactoring with Views

Database triggers are generally pretty awful. Anyone who has had to deal with a database heavily laden with DML triggers knows this and avoids using them. If anyone tells me that they must use a trigger, I can always prove that there’s a better way that doesn’t bury dependencies and business logic so deeply in the database. However, there’s one special case where I find triggers helpful.

Refactoring a schema to support new work is seemingly impossible when you must also serve legacy applications that cannot be easily changed. Views provide an interesting  abstraction that can give you the flexibility you need in this case. Acting like façades over the table structures evolving beneath them, views may use instead-of triggers to translate the data back and forth between the old and new formats, making it seem to the legacy applications that the database hasn’t changed. Let me start with a simple  Address table to demonstrate the idea.

Now suppose that we need to make two changes to this table without disrupting legacy applications that depend on the Address table remaining defined as it has been for some time. First of all, to separate the subject areas of the database better, we’ve decided that we must move the Address table from the [dbo] schema into a schema named [geo]. Next, the individual [Latitude] and [Longitude] attributes must be replaced with a single column of type GEOGRAPHY to support some enhanced geographic processing functions being added for new applications. The new Address table will be created like this:

Moving the data into the new table is straightforward using the GEOGRAPHY::Point function:

After fixing up any foreign key references from the old Address table to the new one, we’re ready to drop the old table and create a new view in its place:

The new view is named exactly like the dropped Address table so SELECT queries work just as they did before:

Now let’s try to insert some data into the view:

The insert fails with the message:

Update or insert of view or function ‘dbo.Address’ failed because it contains a derived or constant field.

The problem is that the view is based on some logic for parsing out latitude and longitude parts from a GEOGRAPHY type. The view doesn’t know how to convert the individual latitude and longitude components back into a GEOGRAPHY so let’s provide an instead-of trigger to do that:

With the INSTEAD OF INSERT (IOI) trigger in place, the insert statement tried before now works. We should add INSTEAD OF UPDATE (IOU) and INSTEAD OF DELETE (IOD) triggers to the view to make sure those operations continue to work for legacy applications, too:

With those triggers in place, the following statements work as we had hoped:

In closing, I’ll admit that this pattern has some potential problems that you may need to address. If you’re using an Object-Relational Mapping (ORM) tool that dynamically inspects and validates metadata in the database, it may get confused by the use of a view where it once found a real table. Also, what the Microsoft Entity Framework (EF) refers to as Navigation Properties, representing the foreign key relationships between tables, may break using this pattern. Also, the UPDATE trigger does not allow any way to update the primary key value as currently implemented. That’s certainly possible using the [deleted] row set provided to the trigger. However, since modifying surrogate, primary keys isn’t commonly expected or allowed, I didn’t provide that more complex implementation. Lastly, you’ll find that as your evolving database design drifts further and further from what the legacy applications use, the harder it will be to maintain the views and their triggers. My Schema Refactoring Pattern, as I call it, is best to employ when you have a firm date in hand when you know the old schema can be deprecated. Triggers are still evil so you should have a solid plan when you begin the refactoring process to stop using them as soon as possible.

On Primary Key Names

If you use frameworks like Microsoft Azure Mobile Services or Ruby on Rails, then you’re accustomed to complying with a host of development conventions. Frameworks are often said to be opinionated, forcing certain design decisions on the developers who use them. Given that very few software design choices are perfect in every situation, the value of having an opinion is often more about consistency than it is about correctness.

“A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. With consistency a great soul has simply nothing to do.” — Ralph Waldo Emerson, Essay: First Series on Self-Reliance

Emerson’s famous quote about the potential perils of standardization is sometimes misapplied. For example, I once attended a seminar by a software vendor where the speaker referred to Ruby on Rails developers as hobgoblins because of their unswerving reliance on programming conventions. Yet, those who understand the Ruby on Rails framework, understand that it is loaded with touchstones that lead us to exhibit good and helpful behaviors most of the time. The seminar speaker’s broad derision of the Rails framework was either based on his misunderstanding or an intent to misdirect the hapless audience for some commercial gain. In his famous essay, Emerson clearly relegates only those friendly yet troublesome creatures of habit that lead to folly as the ones to be categorically avoided.

“One man’s justice is another’s injustice; one man’s beauty another’s ugliness; one man’s wisdom another’s folly.” — Ralph Waldo Emerson

Yet, it is true now and again that the conventions expressed in Rails, Azure Mobile Services, Django, CakePHP and many other frameworks can lead to unfortunate consequences based on the tools’ misapplications or by circumstances  that could simply not be foreseen by the frameworks’ designers. Nowhere else is this more true than in the area of data access. A data pattern that the framework designer considers beautiful in many situations may repulse some database administrators in practice. Application frameworks are often developed and sold for so-called greenfield solutions, those not naturally constrained by prior design decisions in the database and elsewhere. However, many real-world implementations of application frameworks are of the brownfield variety, mired in the muck and the gnarled, organic growth of  the thousands of messy technical decisions that came before. In environments like that, the framework designer’s attempt to enforce one kind of wisdom may prove to be quite foolish.

Primary keys are one of the concerns that application frameworks tend to be opinionated about. Ruby on Rails, Microsoft Azure Mobile Service and Django all name their surrogate, primary keys id by default, meaning identifier or identity. In fact, the word identity is based on the Latin word id which means it or that one. So in these frameworks, when you use the primary key to identify a specific record, you’re sort of saying “I mean that one.”

Database developers and administrators often argue that relational databases aren’t so-called object databases and that naming primary keys the same for all tables leads to confusion and errors in scripting. It’s true that when you read the SQL code in a database that uses id for all the primary key names, it can be a bit confusing. Developers must typically use longer, more meaningful table aliases in their queries to make them understood. Ironically, when the application frameworks that desire uniformity in primary key names generate database queries dynamically, they often emit short table aliases or ones that have little or no relationship to the names of the tables they represent. Have you ever tried to analyze a complex query at runtime that has been written by an Object-Relational Mapping (ORM) tool? It can be positively maddening precisely because the table aliases typically bear no resemblance to the names of the tables they express.

Another problem with having all the primary keys named the same is that it sometimes inhibits other kinds of useful conventions. For example, MySQL supports a highly expressive feature in its JOIN syntax that allows you to write code like this:

SELECT * FROM Order INNER JOIN LineItem USING (OrderID);

In this case, because the Order table’s primary key is named the same as the LineItem’s foreign key to orders, the USING predicate makes it really simple to connect the two tables. One has to admit that’s a very natural-feeling expression. The aforementioned application frameworks’ fondness for naming  all primary keys id makes this sort of practice impossible.

Now that I’ve spent some time besmirching the popular application frameworks for the way they name primary keys, let me defend them a bit. As I said in the beginning, when it comes to frameworks, their opinions are oftentimes more about consistency than objective or even circumstantial correctness. For application developers working in languages like Ruby or C#, having all the primary keys named similarly gives the database a more object-oriented feel. When data models have members that are named consistently from one object to the next, it feels as though the backing database is somewhat object-oriented, with some sort of base table that has common elements in it. If such conventions are reliable, all sorts of time-saving and confusion-banishing practices can be established.

Having done a lot of data architecture work in my career and an equal amount of application development work, my opinion is that naming all database primary keys the same has more benefits than drawbacks across the ecosystem. My opinion is based on the belief that application developers tend to make more mistakes in their interpretations of data than database people do. I believe this is true because as stewards of information, database developers and administrators live and breath data as their core job function while Ruby and C# developers use data as just one of many facets that they manage in building applications. Of course, this is the sort of argument where everyone is correct and no one is. So I’ll not try to claim that my opinion is authoritative. I’m interested in hearing your thoughts on the subject.

Earth Surface Distance in T-SQL

A few years ago, I was working on a project where I needed to calculate the distance between pairs of points on the Earth’s surface. With a bit of research, I found an implementation of the haversine formula written in Python on John D. Cook’s Standalone Numerical Code site. Given a pair of latitude and longitude values, John’s code produces a unit coefficient that can be multiplied by the radius of the sphere to yield the distance in whatever unit of measurement you might want.

Having the Python code was great but I needed a version of the algorithm in F#. After searching for some time and not finding an F# implementation, I decided to write one based on John’s Python version and put it back into the public domain. John published my F# code into his Standalone Numerical Code library here for everyone to use freely.

The exercise for today is to write the haversine formula in Transact-SQL. I’ll start by proposing a test. Between two points that are well-known and using Google Earth as the standard, I define the following variables.

Google Earth estimates that these two pairs of coordinates between Richmond, Virginia USA and São Paulo, Brazil are roughly 4,678 miles or 7,528 kilometers apart. Having no better standard handy, those will have to suffice for some testing later on. Next comes the heart of the haversine formula.

I recommend saving this as a user-defined function in your SQL repository called something like ArcUnitDistance. The reason for that naming is because the value that this calculation produces is the distance between the two points supplied on a surface that has a radius of one in whatever unit of measure you’re going to apply. Now it’s time to put the code to the test.

Commonly used values for the radius of the Earth when using the haversine formula are 3,960 miles or 6,373 kilometers. Of course, the Earth isn’t a sphere. It’s a spheroid so you may find other radius values that you trust more. To find the distance between my test points in those units of measurement, I simply need to multiply the @arcUnitDistance by the radius values. Then, it’s easy to calculate the skew from the expectation and print it out. The test yields these results:

Units Expected Calculated Skew
Kilometers 7527.806 7521.74343697558 0.00080535590641083
Miles 4677.562 4673.79632989539 0.000805049746985879

It seems that for the two test points at least, the skew based on Google Earth as the standard is about 8/100ths of one percent. Over the years, I’ve successfully adapted this haversine code to C#, JavaScript and Java, too. The haversine formula performs better for short distances than using a simple law of cosines-based formula for all sorts of reasons that scientists understand. However, it’s got a margin of error of up to 0.3% in some cases which can be too great for applications that require high precision.

If you want much greater accuracy when calculating distances on spheroids like planet Earth, you should check out the Vincenty Formula instead. It’s marginally more complex than haversine but yields typically better results. I hope you find this version of the haversine formula written in Transact-SQL useful for a variety of distance measurement applications.