What can software engineers learn from NASA’s Mission Control Center

Apr 3, 2024
Edmundo Ortega
Flight controllers in the 1960s at NASA's Mission Control Center in Houston

In the 1960’s, NASA committed to an audacious goal of sending humans to the moon. This project was made exponentially more ambitious with the feature requirement that those humans should also return alive. 

It was quickly understood that the complexity of getting a capsule into orbit and beyond was going to require more coordination than a few dudes with binoculars and slide rules. 

So the Mission Control Center (MCC) was constructed in Houston in order to oversee, monitor, and coordinate the complexity of sending people off the planet. To make a long and obvious story short, NASA met their goal. And it was in no small part attributable to the philosophy of MCC. What can software engineering leaders learn from this approach which has continued to evolve and improve over the last half century? I would argue that it’s not about any one thing—in fact we are already doing many of the things pioneered by NASA—it’s actually the totality of the approach, centered around a single, powerful idea: Situational Awareness. 

NASA realized that mission success would rely on the coordination of hundreds of people across dozens of specialties. They needed to allow individual contributors to make independent decisions while keeping managers and higher-level decision makers informed at all times. When you see the classic photos of mission control, what you’re looking at is a slew of real-time information that everyone across the org could see. This was because strict weight requirements didn’t allow for much redundancy in individual systems—each system needed to be coordinated so that it could act as a makeshift redundancy when needed. To hoard information would hamper those efforts. Comfort with fluid goal reprioritization, constant multi-directional communication, and collaborative problem-solving were radical innovations at the time, and they are still difficult to achieve at most companies. How did NASA pull it off?

Real-time monitoring and support

The Mercury and Apollo spacecraft were thoroughly instrumented. If something went wrong, flight controllers needed to know immediately and they needed to know why. It wasn’t enough to know that the craft was off course, they needed to know that a failure occurred in a specific XXX nozzle and they needed to know immediately. Then they could examine the other nozzles to know which indicators might precede future failures.

Companies like Datadog have revolutionized instrumenting our deployed code. But what about during the development process? When things go wrong during development they can have a long-lasting impact on product readiness and quality. Currently, the state of the art in monitoring the SDLC is basically measuring throughput metrics like cycle time and MTTR. Those are good to know, but they don’t tell you much about what’s really going on. They tell you that the craft is off course, but they don’t tell you why.  

Collective decision-making and problem-solving

When unexpected situations or emergencies arise on a mission, the combined expertise of the flight controllers, engineers, and scientists within MCC becomes instrumental in developing and implementing solutions. The famously recounted Apollo 13 mission is a classic example of how mission control played a vital role in safely returning the crew to Earth after an in-space explosion.

In software engineering, we spend a lot of time communicating status. Each developer works in their own little bubble and must relay their progress, problems, and priorities to their boss to keep them in the loop. The flow of information is bottom-up, and doesn’t always flow the other way. The uneven distribution of information leads to a lack of autonomy, poor decision-making, knowledge silos, and knowledge gaps. 

Yeah. Even just the siloed data make it hard to access unless you are the data expert and know how to get it. But if you want to get a complete picture. Assess the state of Quality on a system, the state of maintainability on a system, you need multiple data analysts and the month to answer even basic questions. With mission control and your data combined, every person is an analyst. Every decision maker can be informed. This improves situational awareness and decision making.

Active Risk Management

At NASA, the cost of failure is ridiculously high (e.g. unfathomable cost, heroic lives, national pride). At the same time, spacecraft, with their tight weight requirements, don’t have the luxury of excessive redundancy. So proactively addressing risk is the name of the game. Mission Control is constantly paying attention to early indicators of possible failure, from electrical and mechanical systems to astronaut biology. It’s not about measuring performance, it’s about knowing when an out-of-bounds datapoint occurs, across millions of measurements, and understanding the size and potential of that outlier. 

The tools we use in software development are capturing tons of data. But that data never gets examined holistically. The best we can do at the moment is to formulate a handful of throughput metrics like cycle-time and MTT. That tells us the what, but not the why. If we could measure risk factors like lack of focus time, tickets with unanswered questions, and knowledge silos, we could see, understand, and cut off problems before they become dumpster fires.  

Continuous Improvement

The learning curve for sending people into space was steep, which meant that every mission was an incredible opportunity for self-evaluation and improvement. But the learning didn’t just happen at the end of every mission—it was constant. As unexpected events occurred (and let’s be honest, when don’t they?), the team was collectively working to adapt and develop innovative new approaches on the fly. The complexity of the missions meant these changes involved process, code, design, materials, preparation… you name it.  

Continuous improvement is a buzz word in software engineering. We do retros, right? But retros are mostly hearsay. What if we could see what really went right and wrong and how that affected our ability to deliver. The truth is that it’s possible, but the time and effort required to really dig into the data just isn’t worth the effort. I’m not afraid to say that much of the retrospective analysis teams do today is actually theater—going through the motions but not actually making a significant impact on future operations. 

Hats off to the OGs

You gotta give it up to those crazy dreamers in the 1960’s that took on an audacious goal and actually achieved it. Can we walk in their footsteps to achieve the next, great CRM, social media app, or self-driving school bus? 

NASA had basically unlimited funds and a near-moral imperative to succeed. What they lacked in technology, they made up for in wherewithal. Can we ordinary, everyday developers, adopt some of their innovations? We think so. That’s why we built VZBL—it’s basically mission control for software teams. We think you need more data, more real-time awareness, more transparency, and more understanding of what’s happening in your mission to deliver product.