Slaying the beast

SQL Compare is the industry standard for comparing and deploying SQL Server database schemas, and is used by 71% of Fortune 100 companies. Like all mature products, however, its code base has been growing for years, with new features added, others updated. The end result? A piece of software described by one developer at Red Gate as: “Middle aged, overweight, lazy and heading for trouble”. To resolve the issue, a team was put together with a single mission. Fix the damn thing. Jeff Foster, Head of Software Engineering, told the story about what happened next to technology writer, Matt Hilbert. It makes a compelling tale.

It was a cold morning in early 2014. A group of software developers, testers, project managers, UX designers and others were sitting in a meeting room at Red Gate. The atmosphere was part optimism, part gloom, part dread. I’d written two words on the whiteboard. Two words that the onlookers had been worrying about ever since being told they were to work together as a team to fix those two words.

SQL Compare.

It wasn’t the product that was dampening their spirits – the product itself is a terrific piece of software that helps developers compare and deploy database changes quickly, simply, and with zero errors.

No, the product was fine. It was the complex code behind it that was the issue. As everyone in the room knew, it had become impossible to make a change to the code and be confident within a reasonable time frame that it wasn’t a breaking change.

The reason was simple: testing. The tests to ensure changes worked were similarly complex and comprehensive. They were a beast, covering all versions of the SQL Server platform from SQL Server 2005 to SQL Server 2014. If the product broke, the test failed.

Worse still, it would often take 12 or more hours for the tests to run – and even when they did run, they might be unreliable because of problems like network connection issues. On any given test there could be 500 random failures, each of which had to be manually reviewed to see whether it was a genuine product regression or simply a transient fault.

What the hell do we do?

That was the question on everyone’s lips. How on earth were we going to refactor the code and make it cleaner, leaner, and simpler, when there was a test problem the size of a Ford F150 pickup truck parked in front of the keyboard?

Our first hypothesis was that the problem was solvable at a team level. The unreliable tests, after all, were a result of coding problems, such as database name clashes and resource management. So our initial plan was to track build and test failures and focus our efforts on the tests that failed the most. A great plan. A simple plan. An optimistic plan.

Thing was, as we started to record information about the builds and tests, we soon realized this was an overly simplistic description of the problem. Even on days when no code was merged, the tests showed a huge variation in build times and the number of failures. The test-infrastructure was much more unreliable that we first thought.

We had many options on how we could tackle this problem of improving the test speed and reliability, ranging from fixing one test a day to a complete ground-up rewrite. But do you know what? Rather than going left or right, we went off-field. We went native, if you like.

This calls for an elegant hack

You see, when we were sitting down, having endless discussions over gallons of coffee and a surprisingly large quantity of chocolate digestives, we realized one simple truth.

This was an impossible problem to solve.

It really was. Teams had tried to tackle it before at Red Gate. Great teams of astonishingly intelligent people. And everyone had failed.

Before we could fix the software, we had to fix the tests. Quite simply, we couldn’t afford to wait 12 hours for each test to run. We would be using Zimmer frames before we got anywhere.

We decided we had to isolate the external communication with SQL Server. Bear with me on this one. Without telling anyone of our plan, we stepped through the code and drew the architecture diagrams for SQL Compare to find a place where we could inject a seam. The communication between SQL Compare and the outside world (SQL Server) funneled through a concrete class. We wanted to sever it.

Our plan was to record all conversations with the SQL Server first. When executing the same test again, we would replay those conversations and verify the behavior was the same. By eliminating our dependencies on the outside world, we would have faster and more reliable tests.

A great plan – an elegant hack that might resolve that testing issue. Might. Now we had to convince everyone.

Time to face the music

I guess you can imagine the scene. Me sitting in a small room with two not very happy bosses, telling them that it would take three months. Not to fix the software and get new features out to customers. Just to fix the testing problem.

I explained our record-replay idea and told them that, in order to insert the record-replay interface, we would have to painstakingly unpick the communication between each side of the boundary.

Oh, and I also mentioned that the whole team would be having Thursday afternoons off to watch DVDs about programming, discuss different methods of programming, and shoot the breeze about Extreme Programming, Mob Programming, and Agile methodologies. To make us a better team, more able to do our job, I said.

I left the room pretty quickly.

Are you sure this is going to work?

Feeling rather nervous, not even sure whether we were mad, misguided or maverick geniuses, we set to work.

SQL Compare has about 300 test fixtures comprising nearly 10,000 tests. Each test follows the general pattern of using two databases and performing some synchronization between the two. The existing style was imperative, giving us the problem that each individual test was responsible for both the test itself and the creation of the test environment. This made it hard for us to centralize the handling of the record-replay part of the code.

We made the decision to layer a fluent API over the existing tests and try to declaratively specify the intent of the test. With all those test fixtures, this was a big commitment but we felt that extracting out the test environment creation was worth it. We rewrote the test fixtures mechanically to centralize all the test creation and were able to insert the record-replay logic into a few places in the code. We used memoization to replay query results from the server and made the assumption that any SQL we generated would remain consistent.

Once the memoization was complete, we no longer needed to restore a database. Even better, we no longer needed to write to a database when synchronizing. If SQL Compare generated the same SQL and applied that to the same database then we could confidently stop there, knowing that (as long as our interface to SQL Server hadn’t changed) behavior would be preserved.

We weren’t communicating over the network with a SQL Server any more. We weren’t restoring backups or setting up databases. We weren’t affected by network outages or backups failing to restore. Our tests were a lot more reliable because they failed for the right reasons, rather than things beyond our control.

Best of all, by the end of those three months, we reduced the number of failing tests from 500+ to zero, and brought the cycle time down to under an hour.

That’s quite something. And we hadn’t started really coding yet.

Now let’s code

With the test issue resolved, we could get at that “middle-aged, overweight” code, confident that we could work on it without being constantly held up.

To kickstart the process, we held a number of whiteboard sketching sessions where we imagined what we’d like SQL Compare to look like. We came up with a simple breakdown of components and, perhaps more importantly, a common language for talking about the product.

We used simple architecture sketches to capture our decisions and ended up with a block diagram describing the components and the responsibilities of each area. This wasn’t reflected in the code yet, but it gave us a starting point for code reviews to make sure that functionality was added in the appropriate place.

Then we got down to the fun stuff. We initially paired, trying to turn a complex refactoring task into a series of simpler ones. Once we’d demonstrated this refactoring plan worked, we split up and worked on separate areas of the code base.

With the aid of those fast running tests, we were able to rapidly recover from failing tests. When a test failed, everyone stopped work until it was fixed. Everyone. The attitude became one of not leaving anything behind that might trip us up in the future.

It’s ain’t what you do, it’s the way that you do it

The great thing about coding in a dedicated team is that you can choose the best coding method for the task at hand. A lot of the time, we pair-programmed to produce absolute quality code. When we hit a particularly difficult problem on a small portion of code, we mob programmed. We all sat in the same room with one keyboard and one screen, and we argued, discussed, and fought over the problem until it was resolved.

We used Agile methodologies, with short sprints. We introduced “The Three Amigos” thinking to create a common understanding and shared vocabulary, and define how new features should work. We adopted XP practices to write code faster. On Thursday afternoons we held group exercises, where we would watch a DVD on how to write a good test, for example, and then discuss it.

The day I knew it was really working was when I overheard someone say: “That’s not good enough for Compare programming”. It became a mantra in the team. If your code wasn’t good enough, own up and do it again.

The project culminated with the release of SQL Compare 11 in October 2014. That was the headline. Behind it, we had refactored the code to make it leaner, simpler, and nicer to work with. We were creating code test-first and had increased unit tests from 1,200 to 1,800. We had managed 142 reported bug fixes. We had added a host of new features to accommodate SQL Server 2014 support. We had further reduced the integration testing time to just 15 minutes.

We had slain the beast.

I’ll leave the last words for Sam Blackburn, software engineer, because they echo the view of the whole team: “I was optimistic when we started, but I honestly didn’t think we could do it. I thought we’d make a couple of dents and improve it slightly. But do you know what? We did something that seemed impossible. And we had a lot of fun on the way.”

Jeff Foster slew the SQL Compare testing beast with a team from Red Gate that included Andrea Angella, Emma Armstrong, Sam Blackburn, Amy Burrows, Reka Burmeister, David Connell, Tom Crossman, Alasdair Daw, Anuradha Deshpande, Alice Easey, Chris George, James Gilmore, Chris Hurley, Mark Jordan, Evan Moss, Adam Parker, Tom Russell, Dom Smith, Michelle Taylor, ChengVoon Tong, Alexandra Turner, Jonathan Watts, and last but definitely not least, David You. Work continues on SQL Compare today to add more features and make it ever more ingeniously simple.