Safer, Not Later

How “Move Fast and Break Things” ruined the world by escaping the context that it was intended for.

Facebook — and by extension, most of Silicon Valley — rightly gets a lot of shit for its old motto, “Move Fast and Break Things”.

As a general principle for living your life, it is obviously terrible advice, and it leads to a lot of the horrific outcomes of Facebook’s business.

I don’t want to be an apologist for Facebook. I also do not want to excuse the worldview that leads to those kinds of outcomes. However, I do want to try to help laypeople understand how software engineers—particularly those situated at the point in history where this motto became popular—actually meant by it. I would like more people in the general public to understand why, to engineers, it was supposed to mean roughly the same thing as Facebook’s newer, goofier-sounding “Move fast with stable infrastructure”.

Move Slow

In the bad old days, circa 2005, two worlds within the software industry were colliding.

The old world was the world of integrated hardware/software companies, like IBM and Apple, and shrink-wrapped software companies like Microsoft and WordPerfect. The new world was software-as-a-service companies like Google, and, yes, Facebook.

In the old world, you delivered software in a physical, shrink-wrapped box, on a yearly release cycle. If you were really aggressive you might ship updates as often as quarterly, but faster than that and your physical shipping infrastructure would not be able to keep pace with new versions. As such, development could proceed in long phases based on those schedules.

In practice what this meant was that in the old world, when development began on a new version, programmers would go absolutely wild adding incredibly buggy, experimental code to see what sorts of things might be possible in a new version, then slowly transition to less coding and more testing, eventually settling into a testing and bug-fixing mode in the last few months before the release.

This is where the idea of “alpha” (development testing) and “beta” (user testing) versions came from. Software in that initial surge of unstable development was extremely likely to malfunction or even crash. Everyone understood that. How could it be otherwise? In an alpha test, the engineers hadn’t even started bug-fixing yet!

In the new world, the idea of a 6-month-long “beta test” was incoherent. If your software was a website, you shipped it to users every time they hit “refresh”. The software was running 24/7, on hardware that you controlled. You could be adding features at every minute of every day. And, now that this was possible, you needed to be adding those features, or your users would get bored and leave for your competitors, who would do it.

But this came along with a new attitude towards quality and reliability. If you needed to ship a feature within 24 hours, you couldn’t write a buggy version that crashed all the time, see how your carefully-selected group of users used it, collect crash reports, fix all the bugs, have a feature-freeze and do nothing but fix bugs for a few months. You needed to be able to ship a stable version of your software on Monday and then have another stable version on Tuesday.

To support this novel sort of development workflow, the industry developed new technologies. I am tempted to tell you about them all. Unit testing, continuous integration servers, error telemetry, system monitoring dashboards, feature flags... this is where a lot of my personal expertise lies. I was very much on the front lines of the “new world” in this conflict, trying to move companies to shorter and shorter development cycles, and move away from the legacy worldview of Big Release Day engineering.

Old habits die hard, though. Most engineers at this point were trained in a world where they had months of continuous quality assurance processes after writing their first rough draft. Such engineers feel understandably nervous about being required to ship their probably-buggy code to paying customers every day. So they would try to slow things down.

Of course, when one is deploying all the time, all other things being equal, it’s easy to ship a show-stopping bug to customers. Organizations would do this, and they’d get burned. And when they’d get burned, they would introduce Processes to slow things down. Some of these would look like:

Let’s keep a special version of our code set aside for testing, and then we’ll test that for a few weeks before sending it to users.
The heads of every department need to sign-off on every deployed version, so everyone needs to spend a day writing up an explanation of their changes.
QA should sign off too, so let’s have an extensive sign-off process where each individual tester does a fills out a sign-off form.

Then there’s my favorite version of this pattern, where management decides that deploys are inherently dangerous, and everyone should probably just stop doing them. It typically proceeds in stages:

Let’s have a deploy freeze, and not deploy on Fridays; don’t want to mess up the weekend debugging an outage.
Actually, let’s extend that freeze for all of December, we don’t want to mess up the holiday shopping season.
Actually why not have the freeze extend into the end of November? Don’t want to mess with Thanksgiving and the Black Friday weekend.
Some of our customers are in India, and Diwali’s also a big deal. Why not extend the freeze from the end of October?
But, come to think of it, we do a fair amount of seasonal sales for Halloween too. How about no deployments from October 10 onward?
You know what, sometimes people like to use our shop for Valentine’s day too. Let’s just never deploy again.

This same anti-pattern can repeat itself with an endlessly proliferating list of “environments”, whose main role ends up being to ensure that no code ever makes it to actual users.

… and break things anyway

As you may have begun to suspect, there are a few problems with this style of software development.

Even back in the bad old days of the 90s when you had to ship disks in boxes, this methodology contained within itself the seeds of its own destruction. As Joel Spolsky memorably put it, Microsoft discovered that this idea that you could introduce a ton of bugs and then just fix them later came along with some massive disadvantages:

The very first version of Microsoft Word for Windows was considered a “death march” project. It took forever. It kept slipping. The whole team was working ridiculous hours, the project was delayed again, and again, and again, and the stress was incredible. [...] The story goes that one programmer, who had to write the code to calculate the height of a line of text, simply wrote “return 12;” and waited for the bug report to come in [...]. The schedule was merely a checklist of features waiting to be turned into bugs. In the post-mortem, this was referred to as “infinite defects methodology”.

Which lead them to what is perhaps the most ironclad law of software engineering:

In general, the longer you wait before fixing a bug, the costlier (in time and money) it is to fix.

A corollary to this is that the longer you wait to discover a bug, the costlier it is to fix.

Some bugs can be found by code review. So you should do code review. Some bugs can be found by automated tests. So you should do automated testing. Some bugs will be found by monitoring dashboards, so you should have monitoring dashboards.

So why not move fast?

But here is where Facebook’s old motto comes in to play. All of those principles above are true, but here are two more things that are true:

No matter how much code review, automated testing, and monitoring you have some bugs can only be found by users interacting with your software.
No bugs can be found merely by slowing down and putting the deploy off another day.

Once you have made the process of releasing software to users sufficiently safe that the potential damage of any given deployment can be reliably limited, it is always best to release your changes to users as quickly as possible.

More importantly, as an engineer, you will naturally have an inherent fear of breaking things. If you make no changes, you cannot be blamed for whatever goes wrong. Particularly if you grew up in the Old World, there is an ever-present temptation to slow down, to avoid shipping, to hold back your changes, just in case.

You will want to move slow, to avoid breaking things. Better to do nothing, to be useless, than to do harm.

For all its faults as an organization, Facebook did, and does, have some excellent infrastructure to avoid breaking their software systems in response to features being deployed to production. In that sense, they’d already done the work to avoid the “harm” of an individual engineer’s changes. If future work needed to be performed to increase safety, then that work should be done by the infrastructure team to make things safer, not by every other engineer slowing down.

The problem is that slowing down is not actually value neutral. To quote myself here:

If you can’t ship a feature, you can’t fix a bug.

When you slow down just for the sake of slowing down, you create more problems.

The first problem that you create is smashing together far too many changes at once.

You’ve got a development team. Every engineer on that team is adding features at some rate. You want them to be doing that work. Necessarily, they’re all integrating them into the codebase to be deployed whenever the next deployment happens.

If a problem occurs with one of those changes, and you want to quickly know which change caused that problem, ideally you want to compare two versions of the software with the smallest number of changes possible between them. Ideally, every individual change would be released on its own, so you can see differences in behavior between versions which contain one change each, not a gigantic avalanche of changes where any one of hundred different features might be the culprit.

If you slow down for the sake of slowing down, you also create a process that cannot respond to failures of the existing code.

I’ve been writing thus far as if a system in a steady state is inherently fine, and each change carries the possibility of benefit but also the risk of failure. This is not always true. Changes don’t just occur in your software. They can happen in the world as well, and your software needs to be able to respond to them.

Back to that holiday shopping season example from earlier: if your deploy freeze prevents all deployments during the holiday season to prevent breakages, what happens when your small but growing e-commerce site encounters a catastrophic bug that has always been there, but only occurs when you have more than 10,000 concurrent users. The breakage is coming from new, never before seen levels of traffic. The breakage is coming from your success, not your code. You’d better be able to ship a fix for that bug real fast, because your only other option to a fast turn-around bug-fix is shutting down the site entirely.

And if you see this failure for the first time on Black Friday, that is not the moment where you want to suddenly develop a new process for deploying on Friday. The only way to ensure that shipping that fix is easy is to ensure that shipping any fix is easy. That it’s a thing your whole team does quickly, all the time.

The motto “Move Fast And Break Things” caught on with a lot of the rest of Silicon Valley because we are all familiar with this toxic, paralyzing fear.

After we have the safety mechanisms in place to make changes as safe as they can be, we just need to push through it, and accept that things might break, but that’s OK.

Some Important Words are Missing

The motto has an implicit preamble, “Once you have done the work to make broken things safe enough, then you should move fast and break things”.

When you are in a conflict about whether to “go fast” or “go slow”, the motto is not supposed to be telling you that the answer is an unqualified “GOTTA GO FAST”. Rather, it is an exhortation to take a beat and to go through a process of interrogating your motivation for slowing down. There are three possible things that a person saying “slow down” could mean about making a change:

It is broken in a way you already understand. If this is the problem, then you should not make the change, because you know it’s not ready. If you already know it’s broken, then the change simply isn’t done. Finish the work, and ship it to users when it’s finished.
It is risky in a way that you don’t have a way to defend against. As far as you know, the change works, but there’s a risk embedded in it that you don’t have any safety tools to deal with. If this is the issue, then what you should do is pause working on this change, and build the safety first.
It is making you nervous in a way you can’t articulate. If you can’t describe an known defect as in point 1, and you can’t outline an improved safety control as in step 2, then this is the time to let go, accept that you might break something, and move fast.

The implied context for “move fast and break things” is only in that third condition. If you’ve already built all the infrastructure that you can think of to build, and you’ve already fixed all the bugs in the change that you need to fix, any further delay will not serve you, do not have any further delays.

Unfortunately, as you probably already know,

This motto did a lot of good in its appropriate context, at its appropriate time. It’s still a useful heuristic for engineers, if the appropriate context is generally understood within the conversation where it is used.

However, it has clearly been taken to mean a lot of significantly more damaging things.

Purely from an engineering perspective, it has been reasonably successful. It’s less and less common to see people in the industry pushing back against tight deployment cycles. It’s also less common to see the basic safety mechanisms (version control, continuous integration, unit testing) get ignored. And many ex-Facebook engineers have used this motto very clearly under the understanding I’ve described here.

Even in the narrow domain of software engineering it is misused. I’ve seen it used to argue a project didn’t need tests; that a deploy could be forced through a safety process; that users did not need to be informed of a change that could potentially impact them personally.

Outside that domain, it’s far worse. It’s generally understood to mean that no safety mechanisms are required at all, that any change a software company wants to make is inherently justified because it’s OK to “move fast”. You can see this interpretation in the way that it has leaked out of Facebook’s engineering culture and suffused its entire management strategy, blundering through market after market and issue after issue, making catastrophic mistakes, making a perfunctory apology and moving on to the next massive harm.

In the decade since it has been retired as Facebook’s official motto, it has been used to defend some truly horrific abuses within the tech industry. You only need to visit the orange website to see it still being used this way.

Even at its best, “move fast and break things” is an engineering heuristic, it is not an ethical principle. Even within the context I’ve described, it’s only okay to move fast and break things. It is never okay to move fast and harm people.

So, while I do think that it is broadly misunderstood by the public, it’s still not a thing I’d ever say again. Instead, I propose this:

Make it safer, don’t make it later.

Acknowledgments

Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor! I am also available for consulting work if you think your organization could benefit from expertise on topics like “how do I make changes to my codebase, but, like, good ones”.

Ungineering

Don’t use the word “engineering” to refer to the process of creating software.

software creativity engineering Friday September 19, 2014

Update 2021: While I still stand by many of the ideas expressed in this essay — particularly “software is made out of feelings” — my views have been significantly changed by two follow-ups. If you're interested in this topic, you should read them both; they’ll teach you more than this will.

The first, “Reverse Ungineering”, by LVH, was a direct response to my post, based on personal experience being trained as a civil engineer and working as a software engineer. Reverse Ungineering changed my opinion almost immediately, so I actually held the view expressed in the summary for a very short period of time after publishing.

The second, “Are We Really Engineers?”, by Hillel Wayne, is a small but comprehensive ethnographic study of people who have done both jobs. It’s extremely eye-opening, and made me realize just how much of my idea of “engineering” was derived from a mixture of fiction and popular culture, and not at all on any reality.

Both of these posts bring to bear informative facts based on direct personal experience, as opposed to my unsubstantiated hypothesizing. While I often still call myself a software “developer” or “author”, and I think that comparisons to fields like writing and research can also be illuminating, I do now call myself an engineer as well. The experience of writing this post and reading its rebuttals taught me an important lesson about not drawing conclusions from an imagined experience that some unfamiliar category of person — in this case, civil engineers — might have.

I am not an engineer.

I am a computer programmer. I am a software developer. I am a software author. I am a coder.

I program computers. I develop software. I write software. I code.

I’d prefer that you not refer to me as an engineer, but this is not an essay about how I’m going to heap scorn upon you if you do so. Sometimes, I myself slip and use the word “engineering” to refer to this activity that I perform. Sometimes I use the word “engineer” to refer to myself or my peers. It is, sadly, fairly conventional to refer to us as “engineers”, and avoiding this term in a context where it’s what everyone else uses is a constant challenge.

Nevertheless, I do not “engineer” software. Neither do you, because nobody has ever known enough about the process of creating software to “engineer” it.

According to dictionary.com, “engineering” is:

“the art or science of making practical application of the knowledge of pure sciences, as physics or chemistry, as in the construction of engines, bridges, buildings, mines, ships, and chemical plants.”

When writing software, we typically do not apply “knowledge of pure sciences”. Very little science is germane to the practical creation of software, and the places where it is relevant (firmware for hard disks, for example, or analytics for physical sensors) are highly rarified. The one thing that we might sometimes use called “science”, i.e. computer science, is a subdiscipline of mathematics, and not a science at all. Even computer science, though, is hardly ever brought to bear - if you’re a working programmer, what was the last project where you had to submit formal algorithmic analysis for any component of your system?

Wikipedia has a heaping helping of criticism of the terminology behind software engineering, but rather than focusing on that, let's see where Wikipedia tells us software engineering comes from in the first place:

The discipline of software engineering was created to address poor quality of software, get projects exceeding time and budget under control, and ensure that software is built systematically, rigorously, measurably, on time, on budget, and within specification. Engineering already addresses all these issues, hence the same principles used in engineering can be applied to software.

Most software projects fail; as of 2009, 44% are late, over budget, or out of specification, and an additional 24% are cancelled entirely. Only a third of projects succeed according to those criteria of being under budget, within specification, and complete.

What would that look like if another engineering discipline had that sort of hit rate? Consider civil engineering. Would you want to live in a city where almost a quarter of all the buildings were simply abandoned half-constructed, or fell down during construction? Where almost half of the buildings were missing floors, had rents in the millions of dollars, or both?

My point is not that the software industry is awful. It certainly can be, at times, but it’s not nearly as grim as the metaphor of civil engineering might suggest. Consider this: despite the statistics above, is using a computer today really like wandering through a crumbling city where a collapsing building might kill you at any moment? No! The social and economic costs of these “failures” is far lower than most process consultants would have you believe. In fact, the cause of many such “failures” is a clumsy, ham-fisted attempt to apply engineering-style budgetary and schedule constraints to a process that looks nothing whatsoever like engineering. I have to use scare quotes around “failure” because many of these projects classified as failed have actually delivered significant value. For example, if the initial specification for a project is overambitious due to lack of information about the difficulty of the tasks involved, for example – an extremely common problem at the beginning of a software project – that would still be a failure according to the metric of “within specification”, but it’s a problem with the specification and not the software.

Certain missteps notwithstanding, most of the progress in software development process improvement in the last couple of decades has been in acknowledging that it can’t really be planned very far in advance. Software vendors now have to constantly present works in progress to their customers, because the longer they go without doing that there is an increasing risk that the software will not meet the somewhat arbitrary goals for being “finished”, and may never be presented to customers at all.

The idea that we should not call ourselves “engineers” is not a new one. It is a minority view, but I’m in good company in that minority. Edsger W. Dijkstra points out that software presents what he calls “radical novelty” - it is too different from all the other types of things that have come before to try to construct it by analogy to those things.

One of the ways in which writing software is different from engineering is the matter of raw materials. Skyscrapers and bridges are made of steel and concrete, but software is made out of feelings. Physical construction projects can be made predictable because the part where creative people are creating the designs - the part of that process most analagous to software - is a small fraction of the time required to create the artifact itself.

Therefore, in order to create software you have to have an “engineering” process that puts its focus primarily upon the psychological issue of making your raw materials - the brains inside the human beings you have acquired for the purpose of software manufacturing - happy, so that they may be efficiently utilized. This is not a common feature of other engineering disciplines.

The process of managing the author’s feelings is a lot more like what an editor does when “constructing” a novel than what a foreperson does when constructing a bridge. In my mind, that is what we should be studying, and modeling, when trying to construct large and complex software systems.

Consequently, not only am I not an engineer, I do not aspire to be an engineer, either. I do not think that it is worthwhile to aspire to the standards of another entirely disparate profession.

This doesn’t mean we shouldn’t measure things, or have quality standards, or try to agree on best practices. We should, by all means, have these things, but we authors of software should construct them in ways that make sense for the specific details of the software development process.

While we are on the subject of things that we are not, I’m also not a maker. I don’t make things. We don’t talk about “building” novels, or “constructing” music, nor should we talk about “building” and “assembling” software. I like software specifically because of all the ways in which it is not like “making” stuff. Making stuff is messy, and hard, and involves making lots of mistakes.

I love how software is ethereal, and mistakes are cheap and reversible, and I don’t have any desire to make it more physical and permanent. When I hear other developers use this language to talk about software, it makes me think that they envy something about physical stuff, and wish that they were doing some kind of construction or factory-design project instead of making an application.

The way we use language affects the way we think. When we use terms like “engineer” and “builder” to describe ourselves as creators, developers, maintainers, and writers of software, we are defining our role by analogy and in reference to other, dissimilar fields.

Right now, I think I prefer the term “developer”, since the verb develop captures both the incremental creation and ongoing maintenance of software, which is so much a part of any long-term work in the field. The only disadvantage of this term seems to be that people occasionally think I do something with apartment buildings, so I am careful to always put the word “software” first.

If you work on software, whichever particular phrasing you prefer, pick one that really calls to mind what software means to you, and don’t get stuck in a tedious metaphor about building bridges or cars or factories or whatever.

To paraphrase a wise man:

I am developer, and so can you.