Grant Fritchey’s SQL Server Howlers

We decided to ask various well-known SQL Server people to write about their favourite SQL Server Howlers. These are the common misunderstandings about how SQL Server works that end in tears, and plaintive forum questions. Grant Fritchey opens the series with some of his favourite howlers.

SQL Server is a pretty amazing piece of software, almost any way you look at it. Further, it comes with an amazing set of documentation in the Books Online (BOL). I’m going to say the following, despite the fact that I publish books and articles for a living; you really don’t have to have any other documentation to run SQL Server in most instances and in most environments, reasonably well. Yeah, there are edge cases and extremes where the BOL can be inadequate, but by & large, they really do a decent job of explaining things. And then there are the howlers. These are the functions that… well, really they’re just not quite documented correctly or thoroughly. How do I know this? Because questions keep coming up in various forums and out on Twitter, the same questions on the same topics.

I ran DBCC CHECKDB and now my database is missing a lot of data. What do I do now?

This will be common amongst most of the howlers, but the truth is, you never get the complete story in the first question. This person did a consistency check of their database, and nothing else, and the database is now offline? Possible, but highly unlikely. No, with investigation, read, additional questions, it’s determined that they ran CHECKDB twice. Once, which discovered errors, and the second time, to repair the errors found. Wait, did you say “repair.” “Yeah, we ran CHECKDB with REPAIR_ALLOW_DATA_LOSS because the business said they didn’t mind if they lost a row or two of the database.”

OK, I know that command actually says ALLOW_DATA_LOSS, and I know that the BOL says “These repairs can cause some data loss.” But you know what? It’s that word “some” that’s killing people out there. They equate it with “a few” or “very little” or “go ahead, what the heck could go wrong, you’re only going to lose SOME data.” Oh, right.

When you tell DBCC to perform the repairs and let it allow for data loss, it’s going to link back together pages any way it can. Minimum, you’re likely to lose a page, which is 8k worth of data. Do the math on your own tables to figure out exactly how many rows that represents, but it’s certainly more than “a few.” But, that’s the best case scenario. Worst case is that it has trouble linking together the pages on the table at all and you lose the whole thing.

What do you do instead? First, you identify the object that has the corrupted page. If it’s a non-clustered index, you can just drop it & recreate it (or run one of the REPAIR options other than allow data loss, they do the same thing). If it’s some other allocation errors, there are other approaches, outlined nicely here in this article by Gail Shaw. If it’s a heap table or a clustered index, then, yes, you’ve lost data. Time to go to your backup. What’s that? You don’t have good backups, or even worse, the backups have never been tested in any way and the corruption exists there as well… That’s a different problem entirely.

For even more advice on how to fix corruption issues, I usually go to the man who wrote DBCC, Paul Randal. This is a good starting point.

There are warnings in the BOL “Use the REPAIR options only as a last resort.” But nothing else. If you don’t know what you’re doing and part of your database is unavailable due to a consistency error, you’ve gone to the one place you know to get help and it offers you these nice little statements that say REPAIR. Well, I’m at my wit’s end and I want to fix this, so this is my last resort.

The BOL needs to tell the truth. “These repairs can go through your database like the Mongol horde through a small village.” Or “These repairs can cause some data loss where higher degrees of “some” is equal to the Greek national debt.” Then, maybe, people would be a little more leery of immediately running these statements.

I’ve implemented everything the Database Tuning Advisor told me to and the server is still slow. How can I speed this up?

The Database Tuning Advisor (DTA) is an attempt from Microsoft to put in place a mechanism to help people who just don’t have the time to learn all the ins & outs of performance tuning. Accidental DBAs all over the planet need a way to quickly tune their databases, but they don’t have all the knowledge they need. Here’s where the DTA enters the picture. There are just a few problems with the DTA.

First, the degree to which the DTA can help is completely limited by the quality of the input. People will frequently take a single query and run it through the DTA assuming that the values and parameters in that query are representative of their system as a whole, or just because that’s the one that seems to be running slow. Or, they’ll take a small set of trace data generated through a server-side trace (or through the Profiler GUI, more on that later), again, assuming that this is representative of their system. The problem is, you need to feed the DTA a lot of very good data in order for it’s recommendations to be worth much.

Second, people will connect the DTA to a development database instead of to one that closely mirrors their production environment. So there will be a lot less data with radically different statistics which causes the DTA to make choices based on bad information. If you want the DTA to figure out why a query or queries are having problems, you need to close match the environment in order to supply it with the best information you have.

Finally, people assume that the DTA is right. It isn’t always. As a matter of fact, it’s frequently wrong. You cannot simply put faith in the fact that the DTA is going to come up with a set of recommendations that are 100% accurate. You must take everything, everything, it recommends and test it. Validate those recommendations against your own systems in order to be sure they’re accurate.

The problem here is, if you open a support call to Microsoft, they will frequently require to you to do two things, run the query, or queries, through the DTA, and implement everything it says before they will begin troubleshooting your problem. That’s just flat out wrong of Microsoft Support.

Alternatives? First and foremost, test everything it suggests to you. Second, learn about query tuning and fix the problem queries yourself, or hire someone who does that.

What does the BOL say? Anything about testing? Not a word. This is as close as I can find to a caution, “Your specific environment and implementation of SQL Server determine the exact issues you must consider when using Database Engine Tuning Advisor.” And that’s it. Nothing like, “Adding multiple manually generated sets of statistics onto one table, maybe, needs some testing to validate that it’ll actually do anything to help your performance” or “Just because we suggested that you add this index with four, rather large INCLUDE columns to help your SELECT query, doesn’t mean we knew about your INSERT query that’s going to be negatively impacted.”

Ever since we implemented the NOLOCK hint on all our queries, things are faster, but some of the users are complaining about the quality of the data. Why is that?

In certain areas, on certain discussion forums, on certain blogs, the NOLOCK hint has become equivalent to a RUN_FASTER hint. It just isn’t true. The issue with NOLOCK is that it allows dirty reads. The example that comes to everyone’s mind is, “I’ve started a transaction to change the value on this column from ‘dog’ to ‘cat’ and because my transaction hasn’t committed, people can see the value ‘dog’ even though I’m changing it to ‘cat’ and I don’t care.” And you know what, if that’s all there was to it, you’d be right. You wouldn’t need to care. But, as I’m sure you’ve already guessed if you’ve read this far, that’s not all there is to it.

The reason NOLOCK appears to make queries run faster is because it won’t take out shared lock hints, which means when modifications to a page are occurring, your query just keeps going, reading what it can, avoiding contention, reducing memory consumption just a tiny bit, and running faster. All good right? Wrong. Remember, you’re reading from a page, not from a row. Ever heard of page splits? It’s when data that fit on a single page is now moved onto two pages because otherwise it would have over-filled the page, some of the rows are rearranged, etc. Now, think about the fact that we’re not putting a shared lock on that page. Now think about the fact that you’re trying to read rows from a page that suddenly splits, you don’t have locks on it, and the data disappears. Now your query returns fewer rows than it ought to. Or, you’re reading from multiple pages, as page splits occur and suddenly, you’ve read the same rows more than once (again, you’re not locking them, so if they go out ahead of you, you’ll never know) and now you have duplicate data. I don’t know about your business, but most businesses are going to get upset when every so often, completely randomly, they lose rows or duplicate rows. What’s that? You don’t care? Cool, then no complaining when it’s your paycheck that misses payroll because it was lost because of your NOLOCK hint. Oh, and one more, if a row gets deleted just as you’re reading it you get an error. So a benign query that does nothing wrong, other than having NOLOCK, just dies for no good reason.

The BOL get this one exactly right:

Allowing dirty reads can cause higher concurrency, but at the cost of reading data modifications that then are rolled back by other transactions. This may generate errors for your transaction, present users with data that was never committed, or cause users to see records twice (or not at all).

But for some reason, people just don’t listen or don’t read. Maybe a flashing neon sign about records duplicating or being missing is required.

What do you do instead? First, there are edge cases where NOLOCK works. But, those exceptional situations should be just that, exceptional. I’ve seen and heard about places where NOLOCK is put on every query, even people trying desperately to put it on INSERT/UPDATE/DELETE queries (guys, you can’t modify data without locking, sorry). Don’t use it everywhere. You’re just begging for problems. Most of your data doesn’t update? Great, move it to a different filegroup and make it read only. Then you won’t get any locking either. Can’t do that but still dealing with contention? Then you ought to look into READ_COMMITTED_SNAPSHOT for your isolation level and take advantage of what versioned rows can do for you.

I know that some of you just won’t believe that your magic “run faster” switch is dangerous, so here’s some further reading on the topic.

What are the limits on the number of indexes you can put on a table? I’ve been implementing all the indexes suggested in the Missing Index DMOs and I’m concerned I might hit that limit.

Really? Just out of curiosity, pick any one of the missing indexes indicated in the DMOs and tell me which query it’s for? Take your time. I’ll wait… That’s right. There is absolutely no way to know which query you just “tuned” by adding an index.

Let’s talk about statistics for a moment. It’s the statistics on your tables and indexes that drive the query optimizer into making certain choices. Statistics are just a sampling of the data in your database. They can be created by doing a full scan of the data which is then distributed into 200 little buckets that describe your multi-million row table, or they can be a more or less random sampling of that same table split into 200 little buckets. In other words, the statistics can be estimates. Further, statistics are only updated when a sufficient quantity of data changes.

What do you care about statistics, you just want to create some indexes, right? Here’s the deal, because the statistics can be out of date or can be inaccurate estimates, the resulting index suggestions in the missing index tables can be flat out wrong.

Further, as I said at the beginning, show me which queries you’re building indexes for. It matters. The missing index information could be for a query that gets run once a month, takes about 90 seconds to complete and doesn’t bother anyone, but you adding the suggested index could negatively impact other queries that are called all day long.

So what am I saying? I’m saying that simply slapping an index on the table doesn’t make sense. You need to know what you are doing to the system and why you are doing it. Rather than simply take a chance, you should investigate and test the indexes suggested. Testing is fairly simple, add a single index at a time and, if possible, in an offline system, separate from your production environment, test the impact of that query. Investigation is slightly more work.

First, you need to check the suggested index to see if it duplicates the functionality of existing indexes or if an existing index can be modified. I’ve seen the missing indexes suggest an index on ColA with an INCLUDE on ColB. This is immediately followed by a suggestion for an index on ColA with an INCLUDE on ColB and ColC. The fact is, we don’t need both those indexes. Assuming we need them at all (see testing above) the second suggestion will satisfy the first. Second, you should look into use XQuery to use the missing index information in execution plans to connect a missing index to a query. I have a blog post here that creates the foundation for this type of querying. You can also read some of the work by Jonathan Kehayias.

We keep rebuilding our indexes and updating our statistics, but the queries are still running slow. Can someone help?

This one is fun, you can read and reread that sentence and still not figure out what’s going on for a while, or, it’ll jump right out at you.

When you rebuild an index, effectively it is dropped and recreated (that’s not entirely true, otherwise, you’d just drop and recreate the index, not rebuild it, but go with me). In doing so, all the data in the table directly related to that index is examined and stored as part of the index. Along with it, a brand new set of statistics is created, based on this full examination of the data, a full scan.

At this point, what you have is a new index, completely defragmented and a thoroughly up to date set of statistics… then, you run sp_updatestats… Which will go through the whole table, determining if any data has changed and the statistics need to be updated. “Any data” reads, one row. So if even a single row has been modified since the last time the statistics were updated (which you just finished doing) you’re going to get new statistics built. Ah, but how are they built? They’re built by sampling the data. That’s what sp_updatestats does by default.

In short, you built a brand new, fully scanned, set of statistics for your indexes, which should help problematic execution plans, and then you’re replacing some or all of them with a sampled set of statistics which, because they’re less accurate, could hurt performance.

The problem is, the Maintenance Plan Wizard doesn’t give you any guidance on this at all. Each of these steps is well described, but there is no indications regarding ordering the processes matters. They absolutely have to have some kind of warning like the ones about mixing drugs in order to prevent people making this mistake.

You need to completely separate these two operations. If you’re rebuilding indexes, don’t update those statistics because they’re up to date. If you’re using Maintenance Plans, instead of using the wizard you might need to drill down and build the plans yourself in order to make the manipulation of statistics dependent on the previous step. If you’re interested in an even better way to do this, I’d suggest reading up on Michelle Ufford’s database maintenance scripts.

Our server was running slow, so I started watching the queries with Profiler and now I’m starting to get errors, why would that be?

So you’ve pointed the Profiler GUI at your production environment, which was already under stress, and now things are worse. This is why language is so important.

All over the place, on blogs, in books, in the BOL, the suggestion is, if you need to monitor query performance you should use Profiler to gather that information. And that’s right, as far as it goes. But, there is a problem. The Profiler GUI puts a different kind of lock on the system to read the information out than you’d get if you set up a server-side trace (scripted trace, doing the same thing as the gui, but with TSQL). Microsoft has it well documented here. But they don’t have it documented at all in BOL. In fact, here’s the opening description for the Profiler documentation:

Microsoft SQL Server Profiler is a graphical user interface to SQL Trace for monitoring an instance of the Database Engine or Analysis Services. You can capture and save data about each event to a file or table to analyze later. For example, you can monitor a production environment to see which stored procedures are affecting performance by executing too slowly.

For example you can connect this to production? No, for example, you can’t. For example, you really should use the GUI for browsing results and for combining Performance Monitor data with the trace output data, but both from files. You shouldn’t be pointing this at production at all.

Summary

So that’s my collection of howlers. I again want to emphasize that I like the Books Online and use them constantly, all day long. They are an excellent resource of material for what to do in SQL Server. There are just some gaps there that need to be filled in to avoid these pernicious errors.

 

Related Content: