Schema Refactoring with Views

Database triggers are generally pretty awful. Anyone who has had to deal with a database heavily laden with DML triggers knows this and avoids using them. If anyone tells me that they must use a trigger, I can always prove that there’s a better way that doesn’t bury dependencies and business logic so deeply in the database. However, there’s one special case where I find triggers helpful.

Refactoring a schema to support new work is seemingly impossible when you must also serve legacy applications that cannot be easily changed. Views provide an interesting  abstraction that can give you the flexibility you need in this case. Acting like façades over the table structures evolving beneath them, views may use instead-of triggers to translate the data back and forth between the old and new formats, making it seem to the legacy applications that the database hasn’t changed. Let me start with a simple  Address table to demonstrate the idea.

Now suppose that we need to make two changes to this table without disrupting legacy applications that depend on the Address table remaining defined as it has been for some time. First of all, to separate the subject areas of the database better, we’ve decided that we must move the Address table from the [dbo] schema into a schema named [geo]. Next, the individual [Latitude] and [Longitude] attributes must be replaced with a single column of type GEOGRAPHY to support some enhanced geographic processing functions being added for new applications. The new Address table will be created like this:

Moving the data into the new table is straightforward using the GEOGRAPHY::Point function:

After fixing up any foreign key references from the old Address table to the new one, we’re ready to drop the old table and create a new view in its place:

The new view is named exactly like the dropped Address table so SELECT queries work just as they did before:

Now let’s try to insert some data into the view:

The insert fails with the message:

Update or insert of view or function ‘dbo.Address’ failed because it contains a derived or constant field.

The problem is that the view is based on some logic for parsing out latitude and longitude parts from a GEOGRAPHY type. The view doesn’t know how to convert the individual latitude and longitude components back into a GEOGRAPHY so let’s provide an instead-of trigger to do that:

With the INSTEAD OF INSERT (IOI) trigger in place, the insert statement tried before now works. We should add INSTEAD OF UPDATE (IOU) and INSTEAD OF DELETE (IOD) triggers to the view to make sure those operations continue to work for legacy applications, too:

With those triggers in place, the following statements work as we had hoped:

In closing, I’ll admit that this pattern has some potential problems that you may need to address. If you’re using an Object-Relational Mapping (ORM) tool that dynamically inspects and validates metadata in the database, it may get confused by the use of a view where it once found a real table. Also, what the Microsoft Entity Framework (EF) refers to as Navigation Properties, representing the foreign key relationships between tables, may break using this pattern. Also, the UPDATE trigger does not allow any way to update the primary key value as currently implemented. That’s certainly possible using the [deleted] row set provided to the trigger. However, since modifying surrogate, primary keys isn’t commonly expected or allowed, I didn’t provide that more complex implementation. Lastly, you’ll find that as your evolving database design drifts further and further from what the legacy applications use, the harder it will be to maintain the views and their triggers. My Schema Refactoring Pattern, as I call it, is best to employ when you have a firm date in hand when you know the old schema can be deprecated. Triggers are still evil so you should have a solid plan when you begin the refactoring process to stop using them as soon as possible.

Generating Filenames Dynamically in SSIS

A file’s name and location are often used to express what’s inside it. Filenames are not required to be meaningful to human beings but they often follow some sort of pattern for categorizing and describing the data inside them. In this way, we can think of the name of a file as being somewhat like metadata. In this article, I’ll focus on a simple example that follows this idea: generating a filename in SQL Server Integration Services (SSIS) that contains the time and date when the file was created. The file creation time is important metadata that other systems can use to make decisions in downstream ETL processes.

In SSIS, the Script Task is good for this sort of thing because a small bit of C# code can add just the kind of needed flexibility and reuse we require. For example, imagine that a file with Employee information will be created from the HR schema. Those bits of metadata can easily be embedded in the filename because they’re somewhat static with respect to the package. However, adding something dynamic like the current date and time to the filename requires some code. To satisfy the requirements in a reusable way, imagine a simple, string-based template that looks like this:

HR_Employees_{yyyyMMdd}T{HHmmss}Z.csv

The text between the curly braces is what we need to parse out to be replaced with the current date and time values. To find the escaped date and time sequences, a regular expression will do nicely. In Figure 1, observe the template string being evaluated in a popular, web-based regular expression tester.

Using a web-based regular expression tool to find escaped date and time sequences.

Figure 1 – Using a web-based regular expression tool to find escaped date and time sequences. Click or tap to see the full-sized image.

Regular expressions are weird but with a bit of study and tools like the one shown in Figure 1, it’s easy to experiment and learn quickly. There are many such regular expression testing tools online as well as a few that can be run natively on your computer. The simple expression tested here is \{\w+\} which looks rather cryptic if you’re unaccustomed to regular expression syntax. However, it’s really quite simple. Reading left to right, the expression means we’re looking for:

  1. A starting curly brace followed by
  2. Any word sequence followed by
  3. An ending curly brace

As you can see in the target string near the bottom of Figure 1, both of the sequences in the template have been found using this regular expression. That means the regular expression will work in the C# code. All that’s needed now is the code that will find those sequences in the template and replace them with their current date and time values.

Before we look at that however, I must drag a new Script Task onto the control flow of my SSIS package. I also need to add two variables to the package that will be used to communicate with the script. Figure 2 shows the control flow with the new Script Task on the design surface, the two new variables that were added and the opened configuration dialog for the Script Task.

Configuring a new Script Task to the SSIS package.

Figure 2 – Configuring a new Script Task to the SSIS package. Click or tap to see the full-sized image.

After dragging a Script Task object from the toolbox onto the control flow, double-clicking it shows the Script Task Editor dialog. To support the invocation of the C# code, two package-level variables called FilenameTemplate and GeneratedFilename were created. You can see them in the variables window near the bottom of Figure 2. Notice that the FilenameTemplate variable has the text with the escaped date and time parts tested earlier. In the Script Task Editor, the FilenameTemplate variable has been added to the ReadOnlyVariables collection and the GeneratedFilename variable has been added to the ReadWriteVariables. That’s important. Failing to add the variables to those collections means they won’t be visible inside the C# code and exceptions will be thrown when trying to use them.

Now we’re ready to write some script code. Clicking the Edit Script button in the Script Task Editor dialog will start a new instance of Visual Studio with the standard scaffolding to support scripting in SSIS. Find the function called Main() and start working there. The first line of code must fetch the contents of the FilenameTemplate variable that was passed in. Here is the line of C# code to do that:

string template = Dts.Variables["FilenameTemplate"].Value.ToString();

With the template in hand, we can convert and save the escaped date and time sequences with the following line of code:

Dts.Variables["GeneratedFilename"].Value = ExpandTemplateDates(template);

Of course, to make that work, we need to implement the ExpandTemplateDates() function, so the following code should be added inside the same class where Main() function is defined.

This method creates the \{\w+\} regular expression tested earlier and uses it to replace the matching sequences in the template parameter. That’s simple to do with .NET’s DateTime class which has a handy ToString() function that can accept the yyyyMMdd and HHmmss formatting strings found in the template. Figure 3 brings all the code together to help you understand.

The script code to find and replace escaped date and time formatting sequences.

Figure 3 – The script code to find and replace escaped date and time formatting sequences. Click or tap to see the full-sized image.

Before closing the C# code editor, it’s a good idea to run the Build command from the menu to make sure there are no syntax errors. To use the new dynamic filename generator, I’ll add one more variable to the package called Filepath. That will be concatenated with the GeneratedFilename to form the full path on disk where the output file from the package will be stored. The connection manager for that file needs to have its ConnectionString property modified at runtime so I’ll use the Expression Builder to do that.

Using the Expression Builder dialog to modify the target ConnectionString.

Figure 4 – Using the Expression Builder dialog to modify the target ConnectionString. Click or tap to see the full-sized image.

From the properties for the connection manager, click the ellipsis (…) button next to the Expressions property and add an expression for the ConnectionString as shown in Figure 4. Once that expression is saved, the full path and name of the file to be saved will be assembled from the Filepath and the GeneratedFilename variables at runtime.

Bringing it all together, Figure 5 shows the results of running the package with a source table, the target flat file bearing the new ConnectionString expression and a Data Flow Task that moves some data from the source to the target. The data flow itself isn’t relevant so it isn’t shown here. What’s important to demonstrate is that the C# code correctly fetched the template variable, processed the regular expression, matched the sequences, replaced them with the date and time values and saved the new filename. The connection manager’s ConnectionString expression also correctly applied the newly generated filename to the path when saving the file to disk.

A test run showing the dynamic filename that was generated and the file on disk with that name.

Figure 5 – A test run showing the dynamic filename that was generated and the file on disk with that name. Click or tap to see the full-sized image.

I marked up the screen shot with a red arrow pointing to the package log output showing the filename that was generated by the C# code when it ran. The blue arrow points to the actual target file on disk, showing that the two match.

There are other ways to do what’s been demonstrated here. However, I find this solution to be both simple and extensible. The example shown here can be easily modified to include many types of dynamic metadata other than dates and times. Moreover, this is a highly reusable pattern given that you need only copy the Script Task into a new SSIS package and set up a couple of package variables to use it anywhere you like. In the next article in this series, I’ll focus on consuming files with dynamically assigned filenames.

On Primary Key Names

If you use frameworks like Microsoft Azure Mobile Services or Ruby on Rails, then you’re accustomed to complying with a host of development conventions. Frameworks are often said to be opinionated, forcing certain design decisions on the developers who use them. Given that very few software design choices are perfect in every situation, the value of having an opinion is often more about consistency than it is about correctness.

“A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. With consistency a great soul has simply nothing to do.” — Ralph Waldo Emerson, Essay: First Series on Self-Reliance

Emerson’s famous quote about the potential perils of standardization is sometimes misapplied. For example, I once attended a seminar by a software vendor where the speaker referred to Ruby on Rails developers as hobgoblins because of their unswerving reliance on programming conventions. Yet, those who understand the Ruby on Rails framework, understand that it is loaded with touchstones that lead us to exhibit good and helpful behaviors most of the time. The seminar speaker’s broad derision of the Rails framework was either based on his misunderstanding or an intent to misdirect the hapless audience for some commercial gain. In his famous essay, Emerson clearly relegates only those friendly yet troublesome creatures of habit that lead to folly as the ones to be categorically avoided.

“One man’s justice is another’s injustice; one man’s beauty another’s ugliness; one man’s wisdom another’s folly.” — Ralph Waldo Emerson

Yet, it is true now and again that the conventions expressed in Rails, Azure Mobile Services, Django, CakePHP and many other frameworks can lead to unfortunate consequences based on the tools’ misapplications or by circumstances  that could simply not be foreseen by the frameworks’ designers. Nowhere else is this more true than in the area of data access. A data pattern that the framework designer considers beautiful in many situations may repulse some database administrators in practice. Application frameworks are often developed and sold for so-called greenfield solutions, those not naturally constrained by prior design decisions in the database and elsewhere. However, many real-world implementations of application frameworks are of the brownfield variety, mired in the muck and the gnarled, organic growth of  the thousands of messy technical decisions that came before. In environments like that, the framework designer’s attempt to enforce one kind of wisdom may prove to be quite foolish.

Primary keys are one of the concerns that application frameworks tend to be opinionated about. Ruby on Rails, Microsoft Azure Mobile Service and Django all name their surrogate, primary keys id by default, meaning identifier or identity. In fact, the word identity is based on the Latin word id which means it or that one. So in these frameworks, when you use the primary key to identify a specific record, you’re sort of saying “I mean that one.”

Database developers and administrators often argue that relational databases aren’t so-called object databases and that naming primary keys the same for all tables leads to confusion and errors in scripting. It’s true that when you read the SQL code in a database that uses id for all the primary key names, it can be a bit confusing. Developers must typically use longer, more meaningful table aliases in their queries to make them understood. Ironically, when the application frameworks that desire uniformity in primary key names generate database queries dynamically, they often emit short table aliases or ones that have little or no relationship to the names of the tables they represent. Have you ever tried to analyze a complex query at runtime that has been written by an Object-Relational Mapping (ORM) tool? It can be positively maddening precisely because the table aliases typically bear no resemblance to the names of the tables they express.

Another problem with having all the primary keys named the same is that it sometimes inhibits other kinds of useful conventions. For example, MySQL supports a highly expressive feature in its JOIN syntax that allows you to write code like this:

SELECT * FROM Order INNER JOIN LineItem USING (OrderID);

In this case, because the Order table’s primary key is named the same as the LineItem’s foreign key to orders, the USING predicate makes it really simple to connect the two tables. One has to admit that’s a very natural-feeling expression. The aforementioned application frameworks’ fondness for naming  all primary keys id makes this sort of practice impossible.

Now that I’ve spent some time besmirching the popular application frameworks for the way they name primary keys, let me defend them a bit. As I said in the beginning, when it comes to frameworks, their opinions are oftentimes more about consistency than objective or even circumstantial correctness. For application developers working in languages like Ruby or C#, having all the primary keys named similarly gives the database a more object-oriented feel. When data models have members that are named consistently from one object to the next, it feels as though the backing database is somewhat object-oriented, with some sort of base table that has common elements in it. If such conventions are reliable, all sorts of time-saving and confusion-banishing practices can be established.

Having done a lot of data architecture work in my career and an equal amount of application development work, my opinion is that naming all database primary keys the same has more benefits than drawbacks across the ecosystem. My opinion is based on the belief that application developers tend to make more mistakes in their interpretations of data than database people do. I believe this is true because as stewards of information, database developers and administrators live and breath data as their core job function while Ruby and C# developers use data as just one of many facets that they manage in building applications. Of course, this is the sort of argument where everyone is correct and no one is. So I’ll not try to claim that my opinion is authoritative. I’m interested in hearing your thoughts on the subject.

Invoking Stored Procedures from Azure Mobile Services

One of the questions I’m often asked is whether it’s possible to call SQL stored procedures from Azure Mobile Services. The answer is yes and it’s probably easier than you think. In case you don’t know, Azure Mobile Services is a way to very simply expose an HTTP service over data stored in an Azure SQL database. By default, Azure Mobile Services exposes the SQL tables directly as resources. So the HTTP methods GET, POST, PUT and DELETE will essentially be mapped to SQL operations on the underlying tables.

While this simple mapping mechanism is good for resource-oriented access to the tables, the logic to produce a usable Web API is often a bit more complex than that. Stored procedures can provide an interesting abstraction layer that allows us to use the efficiency of the SQL Server query engine to reduce round trips to and from Internet clients, for example. Or perhaps stored procedures might be used to hide normalization peculiarities from clients or perhaps to use advanced parameter handling logic. Whatever the case may be, it would be helpful from time to time to be able to invoke stored procedures from the HTTP API that Azure Mobile Services provides.

Let’s start by assuming that an Azure Mobile Service exists with some data that we would like to expose via a stored procedure. For the purposes of this example, my service is called MobileWeatherAlert which contains a backing table in an Azure SQL Database named [MobileWeatherAlert_db]. It’s really helpful that Azure Mobile Services uses schema separation in the underlying database to manage all of its data. That schema separation allows us to expose many separate middle-tier services from one common database if needed. So, in my weather database, there’s a schema called [MobileWeatherAlert] corresponding to the name of the service that it supports. For the purposes of this example, that schema contains a table called [Observation] which is used to collect weather data by [City]. Figure 1 shows a very simple stored procedure called [GetObservationsForCity] that I’d like to be able to call from the service API.

Figure 1 - A simple stored procedure to fetch weather observations for a given city.
Figure 1 – A simple stored procedure to fetch weather observations for a given city.

There are a number of places where this procedure might be invoked. For this example, I’ll implement a custom API in the mobile service called observation. Figure 2 shows the dialog in the Azure management console where the custom API will be created.

Figure 2 - Creating the custom observation API for the MobileWeatherAlert service.
Figure 2 – Creating the custom observation API for the MobileWeatherAlert service.

For this simple example, I’ll only implement the HTTP GET method in the API to invoke the stored procedure. For simplicity of the example, I’ll open up access to everyone to avoid having to pass any sort of credentials. Now I can add a bit of JavaScript to the API to make the stored procedure call. Figure 3 demonstrates adding that JavaScript to the API via the Azure management console.

Figure 3 – The JavaScript to invoke the GetObservationsForCity stored procedure with a URL-sourced city parameter.
Figure 3 – The JavaScript to invoke the GetObservationsForCity stored procedure with a URL-sourced city parameter.

Lines 1 through 9 in the script encompass the get function that will be invoked when the HTTP GET method is used to call the service. The parameters passed to the JavaScript function are the request and response objects. From the request object, line 2 shows how to obtain a reference to the mssql object which exposes a query function for making calls into the database. Line 3 demonstrates how to call the query function to execute the [GetObservationsForCity] stored procedure, passing a single parameter for the City by which to filter. It’s important to note here that the schema in which the stored procedure resides is not named in the EXEC call. This is counter-intuitive, in my opinion, and is likely to trip up novices as they experiment with this functionality. Since we are invoking the GET method for the MobileWeatherAlert service, there’s an implicit assumption used in the preparation of the SQL statement that objects will reside in a similarly-named database schema.

Notice also on Line 3 that the request object passed into the JavaScript function exposes a query property that conveniently contains an object named city which will be parsed directly from the URL. Figure 4 shows how that URL might be passed from PostMan, a really excellent Google Chrome plug in that allows the invocation of nearly any sort of HTTP-oriented web service or API.

Figure 4 - Calling the new API via PostMan to get weather observations for the city of RIchmond.
Figure 4 – Calling the new API via PostMan to get weather observations for the city of Richmond.

Finally, lines 4 through 6 of the JavaScript method, the success function that process the results of the SQL query logs the results and returns them to the caller with an HTTP 201 (OK) response. I’ve included a called to the console.log() function to show how easy it is to log just about anything when you’re debugging your JavaScript code in Azure Mobile Services. After invoking an API or custom resource method that logs something, check out the logs tab of the mobile service in the management console to see what got saved. Of course, you’ll want to do minimal logging in production but while you’re testing and debugging, the log is a valuable resource.

In studying the URL and its output in Figure 4, remember that the JavaScript for the observation API didn’t have to do any special parsing of the row set returned by SQL Server to produce this result. Simply returning that data from SQL Server caused the API to emit JavaScript Object Notation (JSON) which has arguably become the lingua franca of the Internet for expressing data.

In closing, I’ll share a couple of thoughts. If you’re interested in building a simple query interface on top of a mobile service, you don’t have to use stored procedures as shown here. Azure Mobile Services implements fairly rich OData support directly on table resources. With OData, filtering, sorting and pagination of SQL data are built in, so to speak. Also, the web way of doing services (sometimes called RESTful based on Dr. Roy Fielding’s dissertation and the HTTP standards that flowed from it) assume that we’ll use HTTP in the way it was intended: accessing and linking resources at a more basic level, using the HTTP methods GET, POST, PUT, and DELETE as a complete, fully-functional language for accessing those resources. Database people inherently understand and respect this access pattern better than many programmers working in traditional programming languages like C# and Java. After all, we’re accustomed to using four basic methods to manipulate data in our databases: SELECT, INSERT, UPDATE, and DELETE. Yet, as database people, we also know that giving software developers strict table-level access can cause all sorts of performance problems. For those situations, where you know that some complex database operation could be performed much more efficiently with a bit of T-SQL code, a stored procedure or a view may be just the prescription your developers need. Hopefully, this article has helped you understand how to invoke programmatic resources in a SQL Azure database and perhaps it will help you along the way to making the correct architectural choices in the design of your modern, data-driven web applications.

Earth Surface Distance in T-SQL

A few years ago, I was working on a project where I needed to calculate the distance between pairs of points on the Earth’s surface. With a bit of research, I found an implementation of the haversine formula written in Python on John D. Cook’s Standalone Numerical Code site. Given a pair of latitude and longitude values, John’s code produces a unit coefficient that can be multiplied by the radius of the sphere to yield the distance in whatever unit of measurement you might want.

Having the Python code was great but I needed a version of the algorithm in F#. After searching for some time and not finding an F# implementation, I decided to write one based on John’s Python version and put it back into the public domain. John published my F# code into his Standalone Numerical Code library here for everyone to use freely.

The exercise for today is to write the haversine formula in Transact-SQL. I’ll start by proposing a test. Between two points that are well-known and using Google Earth as the standard, I define the following variables.

Google Earth estimates that these two pairs of coordinates between Richmond, Virginia USA and São Paulo, Brazil are roughly 4,678 miles or 7,528 kilometers apart. Having no better standard handy, those will have to suffice for some testing later on. Next comes the heart of the haversine formula.

I recommend saving this as a user-defined function in your SQL repository called something like ArcUnitDistance. The reason for that naming is because the value that this calculation produces is the distance between the two points supplied on a surface that has a radius of one in whatever unit of measure you’re going to apply. Now it’s time to put the code to the test.

Commonly used values for the radius of the Earth when using the haversine formula are 3,960 miles or 6,373 kilometers. Of course, the Earth isn’t a sphere. It’s a spheroid so you may find other radius values that you trust more. To find the distance between my test points in those units of measurement, I simply need to multiply the @arcUnitDistance by the radius values. Then, it’s easy to calculate the skew from the expectation and print it out. The test yields these results:

Units Expected Calculated Skew
Kilometers 7527.806 7521.74343697558 0.00080535590641083
Miles 4677.562 4673.79632989539 0.000805049746985879

It seems that for the two test points at least, the skew based on Google Earth as the standard is about 8/100ths of one percent. Over the years, I’ve successfully adapted this haversine code to C#, JavaScript and Java, too. The haversine formula performs better for short distances than using a simple law of cosines-based formula for all sorts of reasons that scientists understand. However, it’s got a margin of error of up to 0.3% in some cases which can be too great for applications that require high precision.

If you want much greater accuracy when calculating distances on spheroids like planet Earth, you should check out the Vincenty Formula instead. It’s marginally more complex than haversine but yields typically better results. I hope you find this version of the haversine formula written in Transact-SQL useful for a variety of distance measurement applications.