Friday, May 13, 2011

5 Whys, Harder than it looks, Part 1: What's a Failure?

An exercise I gave my software project management class at Northwestern this year is doing 5 Whys process improvement analysis, as in this video and essay from Eric Ries. I also strongly recommend Tony Ford's experience report and example analysis.

There were three parts to the exercise:
  • identifying some key failures
  • developing a 5 Whys causal chain for each failure
  • proposing process improvements to address each step in each chain
I showed the Ries video, assigned the Ries essay, and then had them do similar analyses on several failures in the project they had just finished. I did these as critiqued exercises, figuring that it would take a few iterations to get right. I was right about the need for iterations, way off on the "few" part. Each part has been quite hard for most of the students.

First up:

What is a 5YA failure?

A project has lots of failures, but only some are appropriate for a 5 Whys analysis. Both Ries and Ford uses examples involving some form of web site failure. Here are some of the things my students submitted and the critiques they got. There were many similar examples of each type of problem.
  • Our team had to implement image upload three times
I classify ones like this as as "bad choices led to rework." Clearly something you'd like to avoid in the future, but is it a 5YA (5 Why Appropriate) failure? I say no, because there's no user story failure. A 5YA failure is a user failure, like "the web site crashed" or "the search function returned 'page not found' errors." Without knowing what user stories were impacted and how (broken, missing, slow, ....), you have no idea how to prioritize the importance of the failure and therefore how much effort to expend in trying to avoid it in the future. Furthermore, making things run smoother is much less motivating than avoiding embarrassing public failures.
  • Our client changed the design of the website twice
There were a lot of these "requirement changes led to rework." This has three fundamental things wrong with it. The first problem is the same as the previous example. There's no specific user story impact. Second, this is not a developer failure. You can't fix what you can't control. What you need to fix is how your process handles the things you don't control. It's not a 5YA failure if your cloud service dies. It is a 5YA failure if you have no backup. Third, and most important, this may not even be a failure at all. Iterating on designs is a good thing, if driven by actual usage. The whole point of agile is that it's impossible to get it all correct up front. The corollary is that sometimes we'll get it wrong. That's not a failure. That's life.
  • XXX was working on a fork of the repository so almost none of his code got integrated
I leave what's wrong here as an exercise for the reader. It's one of the points already made.
  • The code XXX submitted for "index" view of the "projects" controller didn't work
  • Buggy and totally non-working code was checked in
These are just way too broad and non-specific. "Didn't work" doesn't distinguish "nothing happened" from "wrong results" from "error page appeared." The devil is in the details. Ignore those details and the devil remains. Trying to fix the problem "code is buggy" leads to ineffective measures like "test more!" and expensive ineffective measures like "everyone goes to Java camp this summer!" How the user story actually breaks leads to very different analyses, responses and priorities.

A 5YA failure is a bug report

My final recommendation for my class was that a 5YA failure should be writable as a bug report. That is, it should be put in the form "when a [user-type] does [action], [failure event] occurs." Failure events should distinguish between nothing happening, the wrong things being return, errors being returned, and so on.