How the type error cost NASA $ 327 million
By Tomasz Kuczma
On December 11, 1998, NASA launched the Mars Climate Orbiter - robotic space probe designed to explore Martian climate from orbit and also to act as a communications relay for polar lander sent 2 months later. Nobody expected that after 9 months journey it will crash in the atmosphere from such simple software mistake.
Hard problems in NASA scale
NASA puts a lot of effort into reliability. A lot of things can happen in space that we do not observe so often (or never) on Earth e.g.:
- cosmic radiation which can cause bit-flip and change the value in the processor registry
- failure of one of the core subsystems like power supply, engine or some sensor Moreover, there are different failures categories:
- Hard failures - cannot be fixed e.g. disk has been physically destroyed - data cannot be read/write
- Soft failures - can be fixed/corrected e.g. some frame in transmission has been lost but can be resend
- Byzantine failures - you cannot say if the element is functioning correctly or not e.g. checksum collision (read “The Byzantine Generals Problem” article by Leslie Lamport) Obviously Byzantine failures are the worst because of their nature.
You also need to change the approach to how software development is managed because:
- Errors cost a lot. You have basically one shot only which costs millions - Mars Climate Orbiter cost $ 327 million
- It is hard or impossible to some fixes/repairs in the space.
- There is no CI/CD nor cloud there. You cannot just “push a fix and deploy” or get a new VM. Your software needs to be ready on day 1 like in the old times where your software was distributed only via CD disk.
As you can see NASA has problems harder by at least order of magnitude than what we usually experience. Never the less they develop many technics how to deal with that mostly with redundancy and intensive testing. The Mars Climate Orbiter was not an exception.
Software error
Still, a “simple” software error snuck into the code. An investigation pointed the failure to a mismatch between measurement two software systems: metric units (newton seconds) used by NASA and non-metric units (pound-force seconds) used by spacecraft builder. The discrepancy factor was almost 4.5 which brought spacecraft around Mars at an altitude of 150-170 km instead of 226 km. In consequence, spacecraft was destroyed in the atmosphere or entered an orbit around the sun. So basically, it was “just” a type error.
Modern software sees that gap
I’m a compiler fan. I just love them because they allow detecting entire classes of errors before code is run pushing the safety and reliability standards before the testing phase. Some of that reliability and safety approaches were in programming for ages e.g. in Ada (especially 95 standard). Also, modern programming languages see that need and have some features to increase type safety.
The first example is Rust. Its type system and ownership model provide memory-safety and thread-safety checked during compilation. It is much more than just type safety which can be achieved like that:
mod si_unit {
struct Newton(f64);
impl Newton {
fn add(self, other: Newton) -> Newton {
Newton(self.0 + other.0)
}
}
}
I used tuple structs here to create a new type representing the proper unit. The more idiomatic implementation would be like:
mod si_unit {
use std::ops::Add;
pub struct Newton(f64);
impl Newton {
pub fn new(value: f64) -> Newton {
Newton(value)
}
}
impl Add for Newton {
type Output = Self;
fn add(self, rhs: Self) -> Self {
Newton(self.0 + rhs.0)
}
}
}
// Usage:
use si_unit::Newton;
fn main() {
let result = Newton::new(2.0) + Newton::new(3.0);
}
Of course, we should consider proper value representation for physic - is f64
enough?
This trick prevents adding values measured in Netwon
with other types e.g. Kilogram
but allows us to define a set of methods for valid operations e.g:
mod si_unit {
// Previous code here
use std::ops::Div;
pub struct Meter2(f64);
pub struct Pascal(f64);
impl Meter2 {
pub fn new(value: f64) -> Meter2 {
Meter2(value)
}
}
impl Div<Meter2> for Newton {
type Output = Pascal;
fn div(self, area: Meter2) -> Self::Output {
Pascal(self.0 / area.0)
}
}
}
// Usage:
use si_unit::Meter2;
use si_unit::Newton;
use si_unit::Pascal;
fn main() {
let result = Newton::new(3.0) / Meter2::new(2.0); // type is Pascal
}
Another example is Kotlin and I hope Java will be soon too. In both, you can always create a new class to express the proper type but it comes with a certain performance overhead (allocation and GC). Kotlin supports (in beta as for now) inline classes that can meet our desire for performance without introducing primitive obsession antipattern. There is a good talk about proper typing in Kotlin - KotlinConf 2019: The Power of Types by Danny Preussler
Software engineer with a passion. Interested in computer networks and large-scale distributed computing. He loves to optimize and simplify software on various levels of abstraction starting from memory ordering through non-blocking algorithms up to system design and end-user experience. Geek. Linux user.