For every big project I’ve worked on as an engineer, there came a crisis. It would happen somewhere along the design path, generally about halfway through the schedule, and it had the potential to wreck the project. This was always unpleasant for the design team, not only because it threatened the project but also because the competition was probably that much further along. Managers might start pointing fingers at one another, and “stress fractures” in the team would threaten to bring the whole project down, even if the underlying issue was minor.
I first saw that happen at Bell Labs in 1979, on the Bellmac microprocessors. A chief architect had drawn up an elaborate blueprint for a new chip and then parceled out the work to many groups, each of which had to work concurrently to devise suitable circuitry for their part of the design. A few months later, all of these parcels had to be integrated together for the first time.
Everything worked perfectly. Oops, no, wait, that was just the plan.
In reality, as everyone approached the moment of truth, all of the groups started sending their spies around to see which group was in the biggest danger of missing the deadline. As long as some other group was worse off, then everyone else felt better. This caused a peculiar reverse motivation: instead of the helping the groups that were worst off, the groups that were doing best sat on their hands. No one wanted to risk becoming the new laggard.
At Intel, an enlightened manager (no, not me) saw this problem looming on the P6 chip design and took pre-emptive action. Instead of simply punishing the apparent “weak link” among the groups, this manager rewarded ahead-of-schedule groups that identified and helped behind-schedule groups. We still didn’t always make the schedule, but at least everyone was pulling in the same direction. And, while no one wanted to get sent to the back of the line to help the stragglers, this manager made sure everyone knew we stood or fell together, and our mutual fortunes would be made on that basis.
But besides schedule crises, there were also technical crises. Always. It’s curious that technical crises always happen. What’s equally curious is how useful they tend to be. It’s in the technical-crisis mode itself that some of the best ideas tend to emerge. Something about the extreme urgency focuses everyone’s minds jointly, and superb insight bubbles up from the combined brains.
Why must these crises happen? It’s a combination of things, but one factor is that you never have perfect knowledge. Not ever. You don’t know the actual status of the project, because it’s changing all the time. So many variables are in play that you cannot possibly gauge them all at once. Imagine that you’re trying to build a house, and one team is handling the windows, a second team is handling the floors, a third is dealing with wiring, and a fourth is dealing with painting. But if the floors have to be changed to oak from maple, then the shade of paint has to change, or if the preferred style of window is out of stock, then perhaps the wiring is affected. And that’s a simple example, because we already know what a house is. Imagine if you didn’t really know what a house was, or had only a rough idea of what a house might be. That’s what new technology projects are often like.
With the business of computer chips, no plan for troubleshooting and testing the product can ever be truly comprehensive, because there are so many combinations in play. (In the first couple of seconds of operation of a new chip, you will have run more cycles than were simulated over the previous five years of design.) So as you go, you resolve the issues as you can. Still, even if you’re really good and really lucky and fix most of the problems, some small but crucial ones might remain. These are seemingly minor issues that take on weight and become chronic, like a ball and chain wrapped around the project’s ankle. Eventually, these chronic issues may force a crisis, or at least a difficult choice between two unappealing alternatives.
In developing Intel’s P6 chip (ultimately known as the Pentium Pro) in the early 1990s, we eventually reached a point in the process when it started to look like our design was off. To be specific—for those who are more technically minded—our estimates of the silicon die size (the die size is the length versus width measure of a microprocessor) were showing us that the dimensions for the chip we wanted to produce were too big. Not so big as to make it impossible to produce, but big enough to go against some other priorities and throw other efforts off balance. In the same way that a dull toothache eventually forces you to see your dentist, this problem eventually pained the senior chip architects enough to cause them to grapple with it head-on. We knew that if we procrastinated any longer, the smaller decisions we were making daily could all become moot.
One night, while traveling, we all came clean at a Chinese restaurant in Santa Clara, California. We admitted to each other that the die-size problem had become chronic, with potentially dire implications should we fail to resolve it, and we got serious. Over General Tso’s chicken, we took a napkin and started drawing possible remedies. Finally, one of us came up with the winning idea. He pointed out that the chip, as originally envisioned, had two separate structures, a design feature we’d come up with as a way to disentangle some complex functions. But at this point in the project, we felt confident that we could handle the complexity associated with combining these two structures into one. So we did. By the time we were eating our fortune cookies, we had a workable plan that got the die size under control, didn’t sacrifice too much performance, and would not generate whole new families of design bugs. Intel should have mounted and framed that napkin, because the basic plan it described has been the basis for all of Intel’s x86 chips since 1993.
My takeaway from this and other similar situations is this: Even well-managed projects will encounter project crises that, if not resolved expeditiously and correctly, can scuttle the project or cause it to end in mediocre products. If you hit such a crisis, remember the advice from Hitchhiker’s Guide to the Galaxy: don’t panic. Don’t waste time looking for scapegoats; remind everyone that they all sink or swim together. Collect your most senior technologists, take them to a Chinese restaurant, and give them as many napkins as it takes to work through the issues.
Send A Letter To the Editors