How I monitor for errors in SQL Server's ERRORLOG

Like any other complex system, things can go wrong in the night (or any other time of day) in SQLServer, and like any other complex system, there's a log of these things that happen. Windows has the Event Viewer, IIS as the W3C logs, and SQLServer has the ERRORLOG text file.

Typically located in \MSSQL\Log, the log file is a record of all "important" activity that has occurred since the server was last started. The term "important" is relative, and generally means "things Microsoft felt should be recorded somewhere." This includes backups, restores, IO errors, CLR loads, etc. In addition to the default things Microsoft puts there, you can tweak SQL Server to log additional rows to the log file using various DBCC trace flags. For example, turning on DBCC traceflags 3605 and 1222 will tell SQL Server that it should log detail information about deadlocks to the error log file -- In my opinion a good idea to always have this flag turned on.

Now, a wise DBA will review this error log file for abnormal activity on a daily basis, but there are two problems. First, the errorlog file can get HUGE! After all, it's a log of all activity (well, important activity) since the server started. And second, there's tons of normal activity that occurs that we know about and don't care about, but it gets logged. Finding the stuff we care about as DBAs among all the other "important" stuff is difficult at best. I've never found the secret DBCC flag called "Only Log the Stuff I Care About". So what do we do?

First, there's a system stored proc called sp_cycle_errorlog -- this does exactly what it says, it tells SQLServer to close the current errorlog and start a new one -- without having to restart SQL Server! The "current" error log is always called ERRORLOG, and the one before that is ERRORLOG.1, the one before that is ERRORLOG.2, etc. To be clear, there are lots of renames going on on a cycle. If you have 10 errorlogs then errorlog.10 is renamed to errorlog.11, errorlog.9 is renamed to errorlog.10, errorlog.8 is renamed to errorlog.9 .... errorlog.1 is renamed errorlog.2 and finally errorlog (the active log) is renamed errorlog.1 and a new log is started.

I have a job in SQL Server that runs at 6AM (why 6AM? Because it seems like a good time!) that has two steps. The first step is simply: Then I have a second step, and this is my version of the missing "Only Log the Stuff I Care About" flag. Since there are all sorts of things that can go wrong in SQL Server I didn't feel comfortable searching for specific things to show. But I do know what I don't want to see! Since the errorlog file is just a text file, I can use grep to rip through the file and pull out what I don't want. Yes, I said what I don't want! There's a seldom used (well, I seldom use it) option in grep called --invert. It reverses the functioning of grep. Grep will typically output all matching lines, with --invert is will output all non-matching lines. This way I can run grep on the errorlog and pull out all the stuff I don't want.

There are two other parts in my error monitoring process. The first is to email myself the results, and to do that I use blat. The second is to account for the fact that the errorlog is in unicode, and grep (at least the version I have) doesn't read unicode. The unicode problem is actually handled by the builtin "type" command. Seems that "typing" a file converts it from unicode to ascii, which is what grep wants.

The command that processes my error log is then: It's actually all on one line in the .cmd I run from w/in the SQL Agent job, but this makes it simpler to see in the blog posting. The first "line" does the conversion from unicode. I know the file name is always errorlog.1 as I run this right after the sp_cycle_errorlog procedure.

The next line runs the tail command to give me everything excluding the first 7 lines (starting in line 8) since the first 7 lines are boilerplate that I don't care about (copyright, version, etc.)

The next two lines pull out the log and database backups -- with hundreds of DBs, and logfile backups every 15 mintues plus nightly full backups this step pulls out lots of unwanted crap.

The next four lines deal with SQLCLR objects we're loading. We know they're getting unloaded due to memory pressure (something we're working on) so we just ignore the load and unload messages. Note however that I'm specifically looking for the load of the assembly "sqlserverobjects", any other load would show up.

The final line emails me the results. And a resulting email is shown below. And as you can see, I have an IO error around 17:45 (line 5). Never would have seen that amongst 10,000 other lines of unimportant important stuff! This is a simple process to code up, and then runs on it's own every day. A great way to see the "truly important" stuff happening in SQL Server.

No comments:

Post a Comment