Computing MD5 Hash of varbinary(max) in SQL Server CLR

SQL Server provides the HASHBYTES function which computes the MD5 hash value of the bytes passed to it with one problem... it only accepts the older (pre-2005) max length of 8000 bytes in a varbinary. That's useful, I suppose, if you're trying to get the MD5 hash of a password, or a name, but completely useless in a case like I had where we needed to compute the MD5 hash of a blob column -- and a blob colum where the max datalength is dozens if not hundreds of megabytes. I needed something different!

Our first pass was to write a command line application that selected the filedata column (the blob) from the table and then just computed the MD5 from w/in c#. That made some sense since the point of this was to check an export we were doing of the file data from blobs to files in the file system but it just seemed wrong somehow. If only HASHBYTES wasn't broken! Then it dawned on us, we could write a SQL CLR function that computed the hash value from w/in SQL Server. This made infinitely more sense.

If you haven't played with them, SQL CLR functions allow you to code a static method in your .Net language of choice (C#!) and then install it into SQL Server as if it was a function you'd written in TSQL. You can use it like any other function in a TSQL statement. SQL Server actually acts as a CLR host, starting the appropriate app domains and managing memory consumption (we'll touch on that later). Your .Net code is actually running IN the SQL Server process! If you're wondering what Microsoft's commitment to SQL CLR is, the new geographic data types in 2008 are implemented using SQL CLR functions! Even though HASHBYTES is broken, we could write our own hash method that did work on varbinary(max) values and still compute the hash value from w/in SQL Server.

My immediate requirement was for a MD5 hash of a varbinary value, but I wanted to make the api slightly more generic so I came up with a method signature that took the varbinary we wanted to hash and the type of hash to compute. This way the same method could handle MD5, SHA1, etc.

My first pass looked like the code below. And it worked great in testing. I passed in the blob to hash, and a string to tell me the type. For example: select dbo.ComputeHash(BlobColumn, 'md5') from TableName and it worked just fine (in test!). I installed it in production and let her rip!

As proof of why you should test corner cases two things happened in production. First, what happens when the code above attempts to compute the MD5 hash of a null? That is, the passed in SqlBinary value is null because the underlying column is null? Well, it goes boom! That's what happens. This isn't a big deal, as the fix is simple, if the Source value is null then return a null. I wasn't sure what the "right" value to return was, but I figure the hash of a null is different from the hash of an empty string, so returning null made the most sense. Problem fixed... rerun...

OK, now there's a new problem. The code ran fine for the first 5000 or so blobs it handled, and then boom! We got around to a blob that was 32MB in size and that was just too darn big. SQL Server killed the CLR app domain because it was consuming too much memory! Remember, the CLR integration is running in-process in SQL Server, and SQL Server's primary concern is to protect itself. When an app domain gets too big SQL Server kills it. Specifically you get:

Msg 6532, Level 16, State 49, Line 1
.NET Framework execution was aborted by escalation policy because of out of memory.

It took some Googling but I came across Chris Bernard's blog post Out of memory exception while attempting to do sql clr and that shed some light on the issue, thanks Chris!

However, I didn't like the idea of messing w/the SQL Server settings, especially since this was a one-time function we were running. So I needed a different solution. I knew the problem was that we were passing the blob in as one giant memory allocation. SQL Server probably buffered it up in its memory, then passed a copy to the CLR which probably also had a copy or two hanging around. This is all fine when you're passing 500 bytes, but with a 32MB blob it adds up fast! If only we could get raw access to the SQL data...

I learned long ago that if you find yourself saying "Gee, it would have made sense if Microsoft did 'x'" then they probably already do 'x', you just need to find out where/how. And this was no different. In addition to the SqlBinary class, there's also a SQLBytes class. And guess what, the SQLBytes class provides a property on it called Stream that is access to the underlying stream of bytes. Since the ComputeHash function takes a stream just as readily as it takes a byte[] we could simply pass the stream in and everything worked.

I kept an eye on the memory usage of the CLR with this new code and it never exceeded 75MB, a much better solution!

Next time we'll look at how to install the CLR function in SQL Server and we'll see why I needed the command line applications to create a hex representation of a file and to popupate the clipboard from the command line.

Command Line Hex Representation of File or Standard Input

In my last post (Populating Clipboard from Console App) I showed a *simple* command line app to copy something to the clipboard. The "something" I wanted to copy to the clipboard was the output of this simple app -- a command-line application to take a file (or any standard input data) and return the hex representation of the input. That is, if you have a file with "Walden" then this app will return "57616C64656E"

Why in the world would I do this? Well there are several reasons, including looking to see the actual byte values in a file if you're checking for unicode/code-page issues, or checking to see if there are hidden null values at the end of a file, but the main reason I wanted it was to give me the hex representation of a CLR DLL so I could create a CLR assembly in SQL Server. You can include the path to the DLL in the CREATE ASSEMBLY statement, or you can include the actual DLL bits. I prefer to include the bits, but that's another post.

I created the AsHex command line application to do this for me. It simply takes what's passed in as standard input and writes its hex representation to standard output -- very unix-like, hard to believe for a Windows bigot like me, I know.

Want to get the hex representation of some text? You can 'echo' it into AsHex like this:

echo Test Text | ashex

Do that and you'll get this as output:


Note that echo honors everything up to the pipe character, including the trailing spaces (0x20), and echo adds a CRLF (0x0d0a) to the end of your input.

Now being a standard command line tool you can also redirect standard input with the < sign. So to accomplish what I wanted (the hex rep of a dll) I can type:

ashex < SqlHashFunctions.dll

and I'll get something that looks like this (all on one line):


The code for this is straightforward:

We simply open the standard input stream and read the bytes one by one and write the hex representation ({0:X2}) to the console.

And if I want the output on the clipboard then I can combine this with the clipboard application and do this:

ashex < SqlHashFunctions.dll | clipboard

and I'll have the hex representation of the dll on the clipboard, ready to be pasted into my SQL script.

Populating Clipboard from Console App

As part of creating a SQL Server CLR function (other posts to come) I needed to copy the output of a command line app to the clipboard. Now I could of course simply redirect the text to a file and then open the file in Notepad++ and do a Ctrl-A Ctrl-C and go, but that offended my command-line sensibilities. So instead I wrote a short console application that takes standard in and copies it to the clipboard.

To say the code is trivial would be an understatement, it's a one-line program. However I did discover that if I didn't set the main function to be STA threaded I couldn't access the clipboard. This makes sense to me since the clipboard access is via COM and COM doesn't like non-STA things.

Like I said, simple code. We set the clipboard data to a text format copy of everything in the stdin stream.

Usage is simple too, just pipe what you want on the clipboard into the program. For example, want a directory listing of c:\ in the clipboard, do this:

dir c:\ | clipboard

Or as another post will show, want the hex representation of a DLL on the clipboard? Do this

AsHex < TheFile.dll | clipboard

You will need add a reference to PresentationCore to get access to the System.Windows namespace. And remember, just because it's a PresentationCore function doesn't mean you need to be in WPF or WinForms to use it.