Computing MD5 Hash of varbinary(max) in SQL Server CLR

SQL Server provides the HASHBYTES function which computes the MD5 hash value of the bytes passed to it with one problem... it only accepts the older (pre-2005) max length of 8000 bytes in a varbinary. That's useful, I suppose, if you're trying to get the MD5 hash of a password, or a name, but completely useless in a case like I had where we needed to compute the MD5 hash of a blob column -- and a blob colum where the max datalength is dozens if not hundreds of megabytes. I needed something different!

Our first pass was to write a command line application that selected the filedata column (the blob) from the table and then just computed the MD5 from w/in c#. That made some sense since the point of this was to check an export we were doing of the file data from blobs to files in the file system but it just seemed wrong somehow. If only HASHBYTES wasn't broken! Then it dawned on us, we could write a SQL CLR function that computed the hash value from w/in SQL Server. This made infinitely more sense.

If you haven't played with them, SQL CLR functions allow you to code a static method in your .Net language of choice (C#!) and then install it into SQL Server as if it was a function you'd written in TSQL. You can use it like any other function in a TSQL statement. SQL Server actually acts as a CLR host, starting the appropriate app domains and managing memory consumption (we'll touch on that later). Your .Net code is actually running IN the SQL Server process! If you're wondering what Microsoft's commitment to SQL CLR is, the new geographic data types in 2008 are implemented using SQL CLR functions! Even though HASHBYTES is broken, we could write our own hash method that did work on varbinary(max) values and still compute the hash value from w/in SQL Server.

My immediate requirement was for a MD5 hash of a varbinary value, but I wanted to make the api slightly more generic so I came up with a method signature that took the varbinary we wanted to hash and the type of hash to compute. This way the same method could handle MD5, SHA1, etc.

My first pass looked like the code below. And it worked great in testing. I passed in the blob to hash, and a string to tell me the type. For example: select dbo.ComputeHash(BlobColumn, 'md5') from TableName and it worked just fine (in test!). I installed it in production and let her rip!

As proof of why you should test corner cases two things happened in production. First, what happens when the code above attempts to compute the MD5 hash of a null? That is, the passed in SqlBinary value is null because the underlying column is null? Well, it goes boom! That's what happens. This isn't a big deal, as the fix is simple, if the Source value is null then return a null. I wasn't sure what the "right" value to return was, but I figure the hash of a null is different from the hash of an empty string, so returning null made the most sense. Problem fixed... rerun...

OK, now there's a new problem. The code ran fine for the first 5000 or so blobs it handled, and then boom! We got around to a blob that was 32MB in size and that was just too darn big. SQL Server killed the CLR app domain because it was consuming too much memory! Remember, the CLR integration is running in-process in SQL Server, and SQL Server's primary concern is to protect itself. When an app domain gets too big SQL Server kills it. Specifically you get:

Msg 6532, Level 16, State 49, Line 1
.NET Framework execution was aborted by escalation policy because of out of memory.

It took some Googling but I came across Chris Bernard's blog post Out of memory exception while attempting to do sql clr and that shed some light on the issue, thanks Chris!

However, I didn't like the idea of messing w/the SQL Server settings, especially since this was a one-time function we were running. So I needed a different solution. I knew the problem was that we were passing the blob in as one giant memory allocation. SQL Server probably buffered it up in its memory, then passed a copy to the CLR which probably also had a copy or two hanging around. This is all fine when you're passing 500 bytes, but with a 32MB blob it adds up fast! If only we could get raw access to the SQL data...

I learned long ago that if you find yourself saying "Gee, it would have made sense if Microsoft did 'x'" then they probably already do 'x', you just need to find out where/how. And this was no different. In addition to the SqlBinary class, there's also a SQLBytes class. And guess what, the SQLBytes class provides a property on it called Stream that is access to the underlying stream of bytes. Since the ComputeHash function takes a stream just as readily as it takes a byte[] we could simply pass the stream in and everything worked.

I kept an eye on the memory usage of the CLR with this new code and it never exceeded 75MB, a much better solution!

Next time we'll look at how to install the CLR function in SQL Server and we'll see why I needed the command line applications to create a hex representation of a file and to popupate the clipboard from the command line.

6 comments:

  1. Hi guys,
    I know this is not related to sql server, but for some of you looking for an online encoder, check this online md5 converter
    Pretty Cool!
    David

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. is your C# solution available somewhere?

    ReplyDelete
  4. Exactly what I needed...
    Well done, mate !

    ReplyDelete
  5. To help my fellow googlers - here is another angle -:
    Select master.sys.fn_repl_hash_binary(cast('Some String more than 8000 bytes long' as varbinary(max)))

    ReplyDelete
  6. Nice article. I wrote a similar article about this last year, benchmarked it against the built-in HASHBYTES() function, and discussed a hybrid function depending on the input length.
    http://sandersdenardi.com/hashing-with-sql-server-clr

    ReplyDelete