Let Me Tell You A Secret

In which I provide you with hundreds of dollars worth of software consulting, for free, in a single blog post.

I do consulting¹ on software architecture, network protocol development, python software infrastructure, streamlined cloud deployment, and open source strategy, among other nerdy things. I enjoy solving challenging, complex technical problems or contributing to the open source commons. On the best jobs, I get to do both.

Today I would like to share with you a secret of the software technology consulting trade.

I should note that this secret is not specific to me. I have several colleagues who have also done software consulting and have reflected versions of this experience back to me.²

We’ll get to the secret itself in a moment, but first, some background.

Companies do not go looking for consulting when things are going great. This is particularly true when looking for high-level consulting on things like system architecture or strategy. Almost by definition, there’s a problem that I have been brought in to solve. Ideally, that problem is a technical challenge.

In the software industry, your team probably already has some software professionals with a variety of technical skills, and thus they know what to do with technical challenges. Which means that, as often as not, the problem is to do with people rather than technology, even it appears otherwise.

When you hire a staff-level professional like myself to address your software team’s general problems, that consultant will need to gather some information. If I am that consultant and I start to suspect that the purported technology problem that you’ve got is in fact a people problem, here is the secret technique that I am going to use:

I am going to go get a pen and a pad of paper, then schedule a 90-minute meeting with the most senior IC³ engineer that you have on your team. I will bring that pen and paper to the meeting. I will then ask one question:

What is fucked up about this place?

I will then write down their response in as much detail as I can manage. If I have begun to suspect that this meeting is necessary, 90 minutes is typically not enough time, and I will struggle to keep up. Even so, I will usually manage to capture the highlights.

One week later, I will schedule a meeting with executive leadership, and during that meeting, I will read back a very lightly edited⁴ version of the transcript of the previous meeting. This is then routinely praised as a keen strategic insight.

I should pause here to explicitly note that — obviously, I hope — this is not an oblique reference to any current or even recent client; if I’d had this meeting recently it would be pretty awkward to answer that “so, I read your blog…” email.⁵ But talking about clients in this way, no matter how obfuscated and vague the description, is always a bit professionally risky. So why risk it?

The thing is, I’m not a people manager. While I can do this kind of work, and I do not begrudge doing it if it is the thing that needs doing, I find it stressful and unfulfilling. I am a technology guy, not a people person. This is generally true of people who elect to go into technology consulting; we know where the management track is, and we didn’t pick it.

If you are going to hire me for my highly specialized technical expertise, I want you to get the maximum value out of it. I know my value; my rates are not low, and I do not want clients to come away with the sense that I only had a couple of “obvious” meetings.

So the intended audience for this piece is potential clients, leaders of teams (or organizations, or companies) who have a general technology problem and are wondering if they need a consultant with my skill-set to help them fix it. Before you decide that your issue is the need to implement a complex distributed system consensus algorithm, check if that is really what’s at issue. Talk to your ICs, and — taking care to make sure they understand that you want honest feedback and that they are safe to offer it — ask them what problems your organization has.

During this meeting it is important to only listen. Especially if you’re at a small company and you are regularly involved in the day-to-day operations, you might feel immediately defensive. Sit with that feeling, and process it later. Don’t unload your emotional state on an employee you have power over.⁶

“Only listening” doesn’t exclusively mean “don’t push back”. You also shouldn’t be committing to fixing anything. While the information you are gathering in these meetings is extremely valuable, and you should probably act on more of it than you will initially want to, your ICs won’t have the full picture. They really may not understand why certain priorities are set the way they are. You’ll need to take that as feedback for improving internal comms rather than “fixing” the perceived problem, and you certainly don’t want to make empty promises.

If you have these conversations directly, you can get something from it that no consultant can offer you: credibility. If you can actively listen, the conversation alone can improve morale. People like having their concerns heard. If, better still, you manage to make meaningful changes to address the concerns you’ve heard about, you can inspire true respect.

As a consultant, I’m going to be seen as some random guy wasting their time with a meeting. Even if you make the changes I recommend, it won’t resonate the same way as someone remembering that they personally told you what was wrong, and you took it seriously and fixed it.

Once you know what the problems are with your organization, and you’ve got solid technical understanding that you really do need that event-driven distributed systems consensus algorithm implemented using Twisted, I’m absolutely your guy. Feel free to get in touch.

While I immensely value my patrons support and am eternally grateful for their support, at — as of this writing — less than $100 per month it doesn’t exactly pay the SF bay area cost-of-living bill. ↩
When I reached out for feedback on a draft of this essay, every other consultant I showed it to said that something similar had happened to them within the last month, all with different clients in different sectors of the industry. I really cannot stress how common it is. ↩
“individual contributor”, if this bit of jargon isn’t universal in your corner of the world; i.e.: “not a manager”. ↩
Mostly, I need to remove a bunch of profanity, but sometimes I will also need to have another interview, usually with a more junior person on the team to confirm that I’m not relaying only a single person’s perspective. It is pretty rare that the top-of-mind problems are specific to one individual, though. ↩
To the extent that this is about anything saliently recent, I am perhaps grumbling about how tech CEOs aren’t taking morale problems generated by the constant drumbeat of layoffs seriously enough. ↩
I am not always in the role of a consultant. At various points in my career, I have also been a leader needing to sit in this particular chair, and believe me, I know it sucks. This would not be a common problem if there weren’t a common reason that leaders tend to avoid this kind of meeting. ↩

Safer, Not Later

How “Move Fast and Break Things” ruined the world by escaping the context that it was intended for.

programming engineering politics management supported Wednesday December 06, 2023

Facebook — and by extension, most of Silicon Valley — rightly gets a lot of shit for its old motto, “Move Fast and Break Things”.

As a general principle for living your life, it is obviously terrible advice, and it leads to a lot of the horrific outcomes of Facebook’s business.

I don’t want to be an apologist for Facebook. I also do not want to excuse the worldview that leads to those kinds of outcomes. However, I do want to try to help laypeople understand how software engineers—particularly those situated at the point in history where this motto became popular—actually meant by it. I would like more people in the general public to understand why, to engineers, it was supposed to mean roughly the same thing as Facebook’s newer, goofier-sounding “Move fast with stable infrastructure”.

Move Slow

In the bad old days, circa 2005, two worlds within the software industry were colliding.

The old world was the world of integrated hardware/software companies, like IBM and Apple, and shrink-wrapped software companies like Microsoft and WordPerfect. The new world was software-as-a-service companies like Google, and, yes, Facebook.

In the old world, you delivered software in a physical, shrink-wrapped box, on a yearly release cycle. If you were really aggressive you might ship updates as often as quarterly, but faster than that and your physical shipping infrastructure would not be able to keep pace with new versions. As such, development could proceed in long phases based on those schedules.

In practice what this meant was that in the old world, when development began on a new version, programmers would go absolutely wild adding incredibly buggy, experimental code to see what sorts of things might be possible in a new version, then slowly transition to less coding and more testing, eventually settling into a testing and bug-fixing mode in the last few months before the release.

This is where the idea of “alpha” (development testing) and “beta” (user testing) versions came from. Software in that initial surge of unstable development was extremely likely to malfunction or even crash. Everyone understood that. How could it be otherwise? In an alpha test, the engineers hadn’t even started bug-fixing yet!

In the new world, the idea of a 6-month-long “beta test” was incoherent. If your software was a website, you shipped it to users every time they hit “refresh”. The software was running 24/7, on hardware that you controlled. You could be adding features at every minute of every day. And, now that this was possible, you needed to be adding those features, or your users would get bored and leave for your competitors, who would do it.

But this came along with a new attitude towards quality and reliability. If you needed to ship a feature within 24 hours, you couldn’t write a buggy version that crashed all the time, see how your carefully-selected group of users used it, collect crash reports, fix all the bugs, have a feature-freeze and do nothing but fix bugs for a few months. You needed to be able to ship a stable version of your software on Monday and then have another stable version on Tuesday.

To support this novel sort of development workflow, the industry developed new technologies. I am tempted to tell you about them all. Unit testing, continuous integration servers, error telemetry, system monitoring dashboards, feature flags... this is where a lot of my personal expertise lies. I was very much on the front lines of the “new world” in this conflict, trying to move companies to shorter and shorter development cycles, and move away from the legacy worldview of Big Release Day engineering.

Old habits die hard, though. Most engineers at this point were trained in a world where they had months of continuous quality assurance processes after writing their first rough draft. Such engineers feel understandably nervous about being required to ship their probably-buggy code to paying customers every day. So they would try to slow things down.

Of course, when one is deploying all the time, all other things being equal, it’s easy to ship a show-stopping bug to customers. Organizations would do this, and they’d get burned. And when they’d get burned, they would introduce Processes to slow things down. Some of these would look like:

Let’s keep a special version of our code set aside for testing, and then we’ll test that for a few weeks before sending it to users.
The heads of every department need to sign-off on every deployed version, so everyone needs to spend a day writing up an explanation of their changes.
QA should sign off too, so let’s have an extensive sign-off process where each individual tester does a fills out a sign-off form.

Then there’s my favorite version of this pattern, where management decides that deploys are inherently dangerous, and everyone should probably just stop doing them. It typically proceeds in stages:

Let’s have a deploy freeze, and not deploy on Fridays; don’t want to mess up the weekend debugging an outage.
Actually, let’s extend that freeze for all of December, we don’t want to mess up the holiday shopping season.
Actually why not have the freeze extend into the end of November? Don’t want to mess with Thanksgiving and the Black Friday weekend.
Some of our customers are in India, and Diwali’s also a big deal. Why not extend the freeze from the end of October?
But, come to think of it, we do a fair amount of seasonal sales for Halloween too. How about no deployments from October 10 onward?
You know what, sometimes people like to use our shop for Valentine’s day too. Let’s just never deploy again.

This same anti-pattern can repeat itself with an endlessly proliferating list of “environments”, whose main role ends up being to ensure that no code ever makes it to actual users.

… and break things anyway

As you may have begun to suspect, there are a few problems with this style of software development.

Even back in the bad old days of the 90s when you had to ship disks in boxes, this methodology contained within itself the seeds of its own destruction. As Joel Spolsky memorably put it, Microsoft discovered that this idea that you could introduce a ton of bugs and then just fix them later came along with some massive disadvantages:

The very first version of Microsoft Word for Windows was considered a “death march” project. It took forever. It kept slipping. The whole team was working ridiculous hours, the project was delayed again, and again, and again, and the stress was incredible. [...] The story goes that one programmer, who had to write the code to calculate the height of a line of text, simply wrote “return 12;” and waited for the bug report to come in [...]. The schedule was merely a checklist of features waiting to be turned into bugs. In the post-mortem, this was referred to as “infinite defects methodology”.

Which lead them to what is perhaps the most ironclad law of software engineering:

In general, the longer you wait before fixing a bug, the costlier (in time and money) it is to fix.

A corollary to this is that the longer you wait to discover a bug, the costlier it is to fix.

Some bugs can be found by code review. So you should do code review. Some bugs can be found by automated tests. So you should do automated testing. Some bugs will be found by monitoring dashboards, so you should have monitoring dashboards.

So why not move fast?

But here is where Facebook’s old motto comes in to play. All of those principles above are true, but here are two more things that are true:

No matter how much code review, automated testing, and monitoring you have some bugs can only be found by users interacting with your software.
No bugs can be found merely by slowing down and putting the deploy off another day.

Once you have made the process of releasing software to users sufficiently safe that the potential damage of any given deployment can be reliably limited, it is always best to release your changes to users as quickly as possible.

More importantly, as an engineer, you will naturally have an inherent fear of breaking things. If you make no changes, you cannot be blamed for whatever goes wrong. Particularly if you grew up in the Old World, there is an ever-present temptation to slow down, to avoid shipping, to hold back your changes, just in case.

You will want to move slow, to avoid breaking things. Better to do nothing, to be useless, than to do harm.

For all its faults as an organization, Facebook did, and does, have some excellent infrastructure to avoid breaking their software systems in response to features being deployed to production. In that sense, they’d already done the work to avoid the “harm” of an individual engineer’s changes. If future work needed to be performed to increase safety, then that work should be done by the infrastructure team to make things safer, not by every other engineer slowing down.

The problem is that slowing down is not actually value neutral. To quote myself here:

If you can’t ship a feature, you can’t fix a bug.

When you slow down just for the sake of slowing down, you create more problems.

The first problem that you create is smashing together far too many changes at once.

You’ve got a development team. Every engineer on that team is adding features at some rate. You want them to be doing that work. Necessarily, they’re all integrating them into the codebase to be deployed whenever the next deployment happens.

If a problem occurs with one of those changes, and you want to quickly know which change caused that problem, ideally you want to compare two versions of the software with the smallest number of changes possible between them. Ideally, every individual change would be released on its own, so you can see differences in behavior between versions which contain one change each, not a gigantic avalanche of changes where any one of hundred different features might be the culprit.

If you slow down for the sake of slowing down, you also create a process that cannot respond to failures of the existing code.

I’ve been writing thus far as if a system in a steady state is inherently fine, and each change carries the possibility of benefit but also the risk of failure. This is not always true. Changes don’t just occur in your software. They can happen in the world as well, and your software needs to be able to respond to them.

Back to that holiday shopping season example from earlier: if your deploy freeze prevents all deployments during the holiday season to prevent breakages, what happens when your small but growing e-commerce site encounters a catastrophic bug that has always been there, but only occurs when you have more than 10,000 concurrent users. The breakage is coming from new, never before seen levels of traffic. The breakage is coming from your success, not your code. You’d better be able to ship a fix for that bug real fast, because your only other option to a fast turn-around bug-fix is shutting down the site entirely.

And if you see this failure for the first time on Black Friday, that is not the moment where you want to suddenly develop a new process for deploying on Friday. The only way to ensure that shipping that fix is easy is to ensure that shipping any fix is easy. That it’s a thing your whole team does quickly, all the time.

The motto “Move Fast And Break Things” caught on with a lot of the rest of Silicon Valley because we are all familiar with this toxic, paralyzing fear.

After we have the safety mechanisms in place to make changes as safe as they can be, we just need to push through it, and accept that things might break, but that’s OK.

Some Important Words are Missing

The motto has an implicit preamble, “Once you have done the work to make broken things safe enough, then you should move fast and break things”.

When you are in a conflict about whether to “go fast” or “go slow”, the motto is not supposed to be telling you that the answer is an unqualified “GOTTA GO FAST”. Rather, it is an exhortation to take a beat and to go through a process of interrogating your motivation for slowing down. There are three possible things that a person saying “slow down” could mean about making a change:

It is broken in a way you already understand. If this is the problem, then you should not make the change, because you know it’s not ready. If you already know it’s broken, then the change simply isn’t done. Finish the work, and ship it to users when it’s finished.
It is risky in a way that you don’t have a way to defend against. As far as you know, the change works, but there’s a risk embedded in it that you don’t have any safety tools to deal with. If this is the issue, then what you should do is pause working on this change, and build the safety first.
It is making you nervous in a way you can’t articulate. If you can’t describe an known defect as in point 1, and you can’t outline an improved safety control as in step 2, then this is the time to let go, accept that you might break something, and move fast.

The implied context for “move fast and break things” is only in that third condition. If you’ve already built all the infrastructure that you can think of to build, and you’ve already fixed all the bugs in the change that you need to fix, any further delay will not serve you, do not have any further delays.

Unfortunately, as you probably already know,

This motto did a lot of good in its appropriate context, at its appropriate time. It’s still a useful heuristic for engineers, if the appropriate context is generally understood within the conversation where it is used.

However, it has clearly been taken to mean a lot of significantly more damaging things.

Purely from an engineering perspective, it has been reasonably successful. It’s less and less common to see people in the industry pushing back against tight deployment cycles. It’s also less common to see the basic safety mechanisms (version control, continuous integration, unit testing) get ignored. And many ex-Facebook engineers have used this motto very clearly under the understanding I’ve described here.

Even in the narrow domain of software engineering it is misused. I’ve seen it used to argue a project didn’t need tests; that a deploy could be forced through a safety process; that users did not need to be informed of a change that could potentially impact them personally.

Outside that domain, it’s far worse. It’s generally understood to mean that no safety mechanisms are required at all, that any change a software company wants to make is inherently justified because it’s OK to “move fast”. You can see this interpretation in the way that it has leaked out of Facebook’s engineering culture and suffused its entire management strategy, blundering through market after market and issue after issue, making catastrophic mistakes, making a perfunctory apology and moving on to the next massive harm.

In the decade since it has been retired as Facebook’s official motto, it has been used to defend some truly horrific abuses within the tech industry. You only need to visit the orange website to see it still being used this way.

Even at its best, “move fast and break things” is an engineering heuristic, it is not an ethical principle. Even within the context I’ve described, it’s only okay to move fast and break things. It is never okay to move fast and harm people.

So, while I do think that it is broadly misunderstood by the public, it’s still not a thing I’d ever say again. Instead, I propose this:

Make it safer, don’t make it later.

Acknowledgments

Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor! I am also available for consulting work if you think your organization could benefit from expertise on topics like “how do I make changes to my codebase, but, like, good ones”.

Bilithification

Not sure how to do microservices? Split your monolith in half.

programming architecture microservices management supported Thursday July 20, 2023

Several years ago at O’Reilly’s Software Architecture conference, within a comprehensive talk on refactoring “Technical Debt: A Masterclass”, r0ml¹ presented a concept that I think should be highlighted.

If you have access to O’Reilly Safari, I think the video is available there, or you can get the slides here. It’s well worth watching in its own right. The talk contains a lot of hard-won wisdom from a decades-long career, but in slides 75-87, he articulates a concept that I believe resolves the perennial pendulum-swing between microservices and monoliths that we see in the Software as a Service world.

I will refer to this concept as “the bilithification strategy”.

Background

Personally, I have long been a microservice skeptic. I would generally articulate this skepticism in terms of “YAGNI”.

Here’s the way I would advise people asking about microservices before encountering this concept:

Microservices are often adopted by small teams due to their advertised benefits. Advocates from very large organizations—ones that have been very successful with microservices—frequently give talks claiming that microservices are more modular, more scalable, and more fault-tolerant than their monolithic progenitors. But these teams rarely appreciate the costs, particularly the costs for smaller orgs. Specifically, there is a fixed operational marginal cost to each new service, and a fairly large fixed operational overhead to the infrastructure for an organization deploying microservices in at all.

With a large enough team, the operational cost is easy to absorb. As the overhead is fixed, it trends towards zero as your total team size and system complexity trend towards infinity. Also, in very large teams, the enforced isolation of components in separate services reduces complexity. It does so specifically intentionally causing the software architecture to mirror the organizational structure of the team that deploys it. This — at the cost of increased operational overhead and decreased efficiency — allows independent parts of the organization to make progress independently, without blocking on each other. Therefore, in smaller teams, as you’re building, you should bias towards building a monolith until the complexity costs of the monolith become apparent. Then you should build the infrastructure to switch to microservices.

I still stand by all of this. However, it’s incomplete.

What does it mean to “switch to microservices”?

The biggest thing that this advice leaves out is a clear understanding of the “micro” in “microservice”. In this framing, I’m implicitly understanding “micro” services to be services that are too small — or at least, too small for your team. But if they do work for large organizations, then at some point, you need to have them. This leaves open several questions:

What size is the right size for a service?
When should you split your monolith up into smaller services?
Wait, how do you even measure “size” of a service? Lines of code? Gigabytes of memory? Number of team members?

In a specific situation I could probably look at these questions for that situation, and make suggestions as to the appropriate course of action, but that’s based largely on vibes. There’s just a lot of drawing on complex experiences, not a repeatable pattern that a team could apply on their own.

We can be clear that you should always start with a monolith. But what should you do when that’s no longer working? How do you even tell when it’s no longer working?

Bilithification

Every codebase begins as a monolith. That is a single (mono) rock (lith). Here’s what it looks like.

a circle with the word “monolith” on it

Let’s say that the monolith, and the attendant team, is getting big enough that we’re beginning to consider microservices. We might now ask, “what is the appropriate number of services to split the monolith into?” and that could provoke endless debate even among a team with total consensus that it might need to be split into some number of services.

Rather than beginning with the premise that there is a correct number, we may observe instead that splitting the service into N services where N is more than one may be accomplished splitting the service in half N-1 times.

So let’s bi (two) lithify (rock) this monolith, and take it from 1 to 2 rocks.

The task of splitting the service into two parts ought to be a manageable amount of work — two is a definitively finite number, as composed to the infinite point-cloud of “microservices”. Thus, we should search, first, for a single logical seam along which we might cleave the monolith.

a circle with the word “monolith” on it and a line through it

In many cases—as in the specific case that r0ml gave—the easiest way to articulate a boundary between two parts of a piece of software is to conceptualize a “frontend” and a “backend”. In the absence of any other clear boundary, the question “does this functionality belong in the front end or the back end” can serve as a simple razor for separating the service.

Remember: the reason we’re splitting this software up is because we are also splitting the team up. You need to think about this in terms of people as well as in terms of functionality. What division between halves would most reduce the number of lines of communication, to reduce the quadratic increase in required communication relationships that comes along with the linear increase in team size? Can you identify two groups who need to talk amongst themselves, but do not need to talk with all of each other?²

two circles with the word “hemilith” on them and a double-headed arrow
between them

Once you’ve achieved this separation, we no longer have a single rock, we have two half-rocks: hemiliths to borrow from the same Greek root that gave us “monolith”.

But we are not finished, of course. Two may not be the correct number of services to end up with. Now, we ask: can we split the frontend into a frontend and backend? Can we split the backend? If so, then we now have four rocks in place of our original one:

four circles with the word “tetartolith” on them and double-headed arrows
connecting them all

You might think that this would be a “tetralith” for “four”, but as they are of a set, they are more properly a tetartolith.

Repeat As Necessary

At some point, you’ll hit a point where you’re looking at a service and asking “what are the two pieces I could split this service into?”, and the answer will be “none, it makes sense as a single piece”. At that point, you will know that you’ve achieved services of the correct size.

One thing about this insight that may disappoint some engineers is the realization that service-oriented architecture is much more an engineering management tool than it is an engineering tool. It’s fun to think that “microservices” will let you play around with weird technologies and niche programming languages consequence-free at your day job because those can all be “separate services”, but that was always a fantasy. Integrating multiple technologies is expensive, and introducing more moving parts always introduces more failure points.

Advanced Techniques: A Multi-Stack Microservice Environment

You’ll note that splitting a service heavily implies that the resulting services will still all be in the same programming language and the same tech stack as before. If you’re interested in deploying multiple stacks (languages, frameworks, libraries), you can proceed to that outcome via bilithification, but it is a multi-step process.

First, you have to complete the strategy that I’ve already outlined above. You need to get to a service that is sufficiently granular that it is atomic; you don’t want to split it up any further.

Let’s call that service “X”.

Second, you identify the additional complexity that would be introduced by using a different tech stack. It’s important to be realistic here! New technology always seems fun, and if you’re investigating this, you’re probably predisposed to think it would be an improvement. So identify your costs first and make sure you have them enumerated before you move on to the benefits.

Third, identify the concrete benefits to X’s problem domain that the new tech stack would provide.

Finally, do a cost-benefit analysis where you make sure that the costs from step 2 are clearly exceeded by the benefits from step three. If you can’t readily identify that in advance – sometimes experimentation is required — then you need to treat this as an experiment, rather than as a strategic direction, until you’ve had a chance to answer whatever questions you have about the new technology’s benefits benefits.

Note, also, that this cost-benefit analysis requires not only doing the technical analysis but getting buy-in from the entire team tasked with maintaining that component.

Conclusion

To summarize:

Always start with a monolith.
When the monolith is too big, both in terms of team and of codebase, split the monolith in half until it doesn’t make sense to split it in half any more.
(Optional) Carefully evaluate services that want to adopt new technologies, and keep the costs of doing that in mind.

There is, of course, a world of complexity beyond this associated with managing the cost of a service-oriented architecture and solving specific technical problems that arise from that architecture.

If you remember the tetartolith, though, you should at least be able to get to the right number and size of services for your system.

Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor! I am also available for consulting work if you think your organization could benefit from more specificity on the sort of insight you've seen here.

AKA “my father” ↩
Denouncing “silos” within organizations is so common that it’s a tired trope at this point. There is no shortage of vaguely inspirational articles across the business trade-rag web and on LinkedIn exhorting us to “break down silos”, but silos are the point of having an organization. If everybody needs to talk to everybody else in your entire organization, if no silos exist to prevent the necessity of that communication, then you are definitionally operating that organization at minimal efficiency. What people actually want when they talk about “breaking down silos” is a re-org into a functional hierarchy rather than a role-oriented hierarchy (i.e., “this is the team that makes and markets the Foo product, this is the team that makes and markets the Bar product” as opposed to “this is the sales team”, “this is the engineering team”). But that’s a separate post, probably. ↩

A Response to Jacob Kaplan-Moss’s “Incompetent But Nice”

What can managers do about employees who are easy to work with, and are trying their best, but can’t seem to get the job done?

management Wednesday March 29, 2023

Jacob Kaplan-Moss has written a post about one of the most stressful things that can happen to you as a manager: when someone on your team is getting along well with the team, apparently trying their best, but unable to do the work. Incompetent, but nice.

I have some relevant experience with this issue. On more than one occasion in my career:

I have been this person, more than once. I have both resigned, and been fired, as a result.
I’ve been this person’s manager, and had to fire them after being unable to come up with a training plan that would allow them to improve.
I’ve been an individual contributor on a team with this person, trying to help them improve.

So I can speak to this issue from all angles, and I can confirm: it is gut-wrenchingly awful no matter where you are in relation to it. It is social stress in its purest form. Everyone I’ve been on some side of this dynamic with is someone that I’d love to work with again. Including the managers who fired me!

Perhaps most of all, since I am not on either side of an employer/employee relationship right now¹, I have some emotional distance from this kind of stress which allows me to write about it in a more detached way than others with more recent and proximate experience.

As such, I’d like to share some advice, in the unfortunate event that you find yourself in one of these positions and are trying to figure out what to do.

I’m going to speak from the perspective of the manager here, because that’s where most of the responsibility for decision-making lies. If you’re a teammate you probably need to wait for a manager to ask you to do something, if you’re the underperformer here you’re probably already trying as hard as you can to improve. But hopefully this reasoning process can help you understand what the manager is trying to do here, and find the bits of the process you can take some initiative to help with.

Step 0: Preliminaries

First let’s lay some ground rules.

Breathe. Maintaining an explicit focus on explicitly regulating your own mood is important, regardless of whether you’re the manager, teammate, or the underperformer.
Accept that this may be intractable. You’re going to do your best in this situation but you are probably choosing between bad options. Nevertheless you will need to make decisions as confidently and quickly as possible. Letting this situation drag on can be a recipe for misery.
You will need to do a retrospective.² Get ready to collect information as you go through the process to try to identify what happened in detail so you can analyze it later. If you are the hiring manager, that means that after you’ve got your self-compassion together and your equanimous professional mood locked in, you will also need to reflect on the fact that you probably fucked up here, and get ready to try to improve your own skills and processes so that you don’t end up this situation again.

I’m going to try to pick up where Jacob left off, and skip over the “easy” parts of this process. As he puts it:

The proximate answer is fairly easy: you try to help them level up: pay for classes, conferences, and/or books; connect them with mentors or coaches; figure out if something’s in the way and remove the blocker. Sometimes folks in this category just need to learn, and because they’re nice it’s easy to give them a lot of support and runway to level up. Sometimes these are folks with things going on in their lives outside work and they need some time (or some leave) to focus on stuff that’s more important than work. Sometimes the job has requirements that can be shifted or eased or dropped – you can match the work to what the person’s good at. These situations aren’t always easy but they are simple: figure out the problem and make a change.

Step 1: Figuring Out What’s Going On

There are different reasons why someone might underperform.

Possibility: The person is over-leveled.

This is rare. For the most part, pervasive under-leveling is more of a problem in the industry. But, it does happen, and when it happens, what it looks like is someone who is capable of doing some of the work that they’re assigned, but getting stuck and panicking with more challenging or ambiguous assignments.

Moreover, this particular organizational antipattern, the “nice but incompetent” person, is more likely to be over-leveled, because if they’re friendly and getting along with the team, then it’s likely they made a good first impression and made the hiring committee want to go to bat for them as well, and may have done them the “favor” of giving them a higher level. This is something to consider in the retrospective.

Now, Jacob’s construction of this situation explicitly allows for “leveling up”, but the sort of over-leveling that can be resolved with a couple of conference talks, books, and mentoring sessions is just a challenge. We’re talking about persistent resistance to change here, and that sort of over-leveling means that they have skipped an entire rung on the professional development ladder, and do not have the professional experience at this point to bridge the gaps between their experience and the ladder.

If this is the case, consider a demotion. Identify the aspects of the work that the underperformer can handle, and try to match them to a role. If you are yourself the underperformer, proactively identifying what you’re actually doing well at and acknowledging the work you can’t handle, and identifying any other open headcount for roles at that level can really make this process easier for your manager.

However, be sure to be realistic. Are they capable enough to work at that reduced level? Does your team really have a role for someone at that level? Don’t invent makework in the hopes that you can give them a bootleg undergraduate degree’s worth of training on the job; they need to be able to contribute.

Possibility: Undiagnosed health issues

Jacob already addressed the “easy” version here: someone is struggling with an issue that they know about and they can ask for help, or at least you can infer the help they need from the things they’ve explicitly said.

But the underperformer might have something going on which they don’t realize is an issue. Or they might know there’s an issue but not think it’s serious, or not be able to find a diagnosis or treatment. Most frequently, this is a mental health problem, but it can also be things like unexplained fatigue.

This possibility is the worst. Not only do you feel like a monster for adding insult to injury, there’s also a lot of risk in discussing it.

Sometimes, you feel like you can just tell³ that somebody is suffering from a particular malady. It may seem obvious to you. If you have any empathy, you probably want to help them. However, you have to be careful.

First, some illnesses qualify as disabilities. Just because the employee has not disclosed their disability to you, does not mean they are unaware. It is up to the employee whether to tell you, and you are not allowed to require them to disclose anything to you. They may have good reasons for not talking about it.

Beyond highlighting some relevant government policy I am not equipped on how to advise you on how to handle this. You probably want to loop in someone from HR and/or someone from Legal, and have a meeting to discuss the particulars of what’s happening and how you’d like to try to help.

Second, there’s a big power differential here. You have to think hard about how to broach the subject; you’re their boss, telling them that you think they’re too sick to work. In this situation, they explicitly don’t necessarily agree, and that can quite reasonably be perceived as an attack and an insult, even if you’re correct. Hopefully the folks in Legal or HR can help you with some strategies here; again, I’m not really qualified to do anything but point at the risks involved and say “oh no”.

The “good” news here is that if this really is the cause, then there’s not a whole lot to change in your retrospective. People get sick, their families get sick, it can’t always be predicted or prevented.

Possibility: Burnout

While it is certainly a mental health issue in its own right, burnout is specifically occupational and you can thus be a bit more confident as a manager recognizing it in an employment context.

This is better to prevent than to address, but if you’ve got someone burning out badly enough to be a serious performance issue, you need to give them more leave than they think they need, and you need to do it quickly. A week or two off is not going to cut it.

In my experience, this is the most common cause of an earnest but agreeable employee underperforming, and it is also the one we are most reluctant to see. Each step on the road to burnout seems locally reasonable.

Just push yourself a little harder. Just ask for a little overtime. Just until this deadline; it’s really important. Just until we can hire someone; we’ve already got a req open for that role. Just for this one client.

It feels like we should be able to white-knuckle our way through “just” a little inconvenience. It feels that way both individually and collectively. But the impacts are serious, and they are cumulative.

There are two ways this can manifest.

Usually, it’s a gradual decline that you can see over time, and you’ll see this in an employee that was previously doing okay, but now can’t hack it.

However, another manifestation is someone who was burned out at their previous role, did not take any break between jobs, and has stepped into a moderately stressful role which could be a healthy level of challenge for someone refreshed and taking frequent enough breaks, but is too demanding for someone who needs to recover.

If that’s the case, and you feel like you accurately identified a promising candidate, it is really worthwhile to get that person the break that they need. Just based on vague back-of-the-envelope averages, it would typically be about as expensive to find a way to wrangle 8 weeks of extra leave than to go through the whole hiring process for a mid-career software engineer from scratch. However, that math assumes that the morale cost of firing someone is zero, and the morale benefit of being seen to actually care about your people and proactively treat them well as also zero.

If you can positively identify this as the issue, then you have a lot of work to do in the retrospective. People do not spontaneously burn out by themselves. this is a management problem and this person is likely to be the first domino to fall. You may need to make some pretty big changes across your team.

Possibility: Persistent Personality Conflict

It may be that someone else on the team is actually the problem. If the underperformer is inconsistent, observe the timing of the inconsistencies: does it get much worse when they’re assigned to work closely with someone else? Note that “personality conflict” does not mean that the other person is necessarily an asshole; it is possible for communication styles to simply fail to mesh due to personality differences which cannot be easily addressed.

You will be tempted to try to reshuffle responsibilities to keep these team members further apart from each other.

Don’t.

People with a conflict that is persistently interfering in day-to-day work need to be on different teams entirely. If you attempt to separate them but have them working closely, then inevitably one is going to get the higher-status projects and the other is going to be passed over for advancement. Or they’re going to end up drifting back into similar areas again.

Find a way to transfer them internally far enough away that they can have breathing room away from this conflict. If you can’t do that, then a firing may be your best option.

In the retrospective, it’s worth examining the entire team dynamic at this point, to see if the conflict is really just between two people, or if it’s more pervasive than that and other members of the team are just handling it better.

Step 2: Help By Being Kind, Not By Being Nice

Again, we’re already past the basics here. You’ve already got training and leave and such out of the way. You’re probably going to need to make a big change.

Responding to Jacob, specifically:

Firing them feels wrong; keeping them on feels wrong.

I think that which is right is heavily context-dependent. But, realistically, firing is the more likely option. You’ve got someone here who isn’t performing adequately, and you’ve already deployed all the normal tools to try to dig yourself out of that situation.

So let’s talk about that first.

Firing

Briefly, If you need to fire them, just fire them, and do it quickly.

Firing people sucks, even obnoxious people. Not to mention that this situation we’re in is about a person that you like! You’ll want to be nice to this person. The person is also almost certainly trying their best. It’s pretty hard to be agreeable to your team if you’re disappointing that team and not even trying to improve.

If you find yourself thinking “I probably need to fire this person but it’s going to be hard on them”, the thought “hard on them” indicates you are focused on trying to help them personally, and not what is best for your company, your team, or even the employee themselves professionally. The way to show kindness in that situation is not to keep them in a role that’s bad for them and for you.

It would be much better for the underperformer to find a role where they are not an underperformer, and at this point, that role is probably not on your team. Every minute that you keep them on in the bad role is a minute that they can’t spend finding a good one.

As such, you need to immediately shift gears towards finding them a soft landing that does not involve delaying whatever action you need to take.

Being kind is fine. It is not even a conflict to try to spend some company resources to show that kindness. It is in the best interest of all concerned that anyone you have to fire or let go is inclined to sing your praises wherever they end up. The best brand marketing in the world for your jobs page is a diaspora of employees who wish they could still be on your team.

But being nice here, being conflict-avoidant and agreeable, only drags out an unpleasant situation. The way to spend company resources on kindness is to negotiate with your management for as large a severance package as you can manage, and give as much runway as possible for their job search, and clarity about what else you can do.

For example, are you usable as a positive reference? I.e., did they ever have a period where their performance was good, which you could communicate to future employers? Be clear.

Not-Firing

But while I think it’s the more likely option, it’s certainly not the only option. There are certainly cases where underperformers really can be re-situated into better roles, and this person could still find a good fit within the team, or at least, the company. You think you’ve solved the mystery of the cause of the problem here, and you need to make a change. What then?

In that case, the next step is to have a serious conversation about performance management. Set expectations clearly. Ironically, if you’re dealing with a jerk, you’ve probably already crisply communicated your issues. But if you’re dealing with a nice person, you’re more likely to have a slow, painful drift into this awkward situation, where they probably vaguely know that they’re doing poorly but may not realize how poorly.

If that’s what’s happening, you need to immediately correct it, even if firing isn’t on the table. If you’ve gotten to this point, some significant action is probably necessary. Make sure they understand the urgency of the situation, and if you have multiple options for them to consider, give them a clear timeline for how long they have to make a decision.

As I detailed above, things like a down-leveling or extended leave might be on the table. You probably do not want to do anything like that in a normal 1x1: make sure you have enough time to deal with it.

Remember: Run That Retrospective!

In most cases where this sort of situation develops, is a clear management failure. If you’re the manager, you need to own it.

Try to determine where the failure was. Was it hiring? Leveling? A team culture that encourages burnout? A poorly-vetted internal transfer? Accepting low performance for too long, not communicating expectations?

If you can identify a systemic cause as actionable, then you need to make sure there is time and space to make necessary changes, and soon. It’s stressful to have to go through this process with one person, but if you have to do this repeatedly, any dynamic that can damage multiple people’s productivity persistently is almost definitionally a team-destroyer.

REMEMBER TO LIKE AND SUBSCRIBE ↩
I know it’s a habit we all have from industry jargon — heck, I used the term in my own initial Mastodon posts about this — but “postmortem” is a fraught term in the best of circumstances, and super not great when you’re talking about an actual person who has recently been fired. Try to stick to “retrospective” or “review” when you’re talking about this process with your team. ↩
Now, I don’t know if this is just me, but for reasons that are outside the scope of this post, when I was in my mid teens I got a copy of the DSM-IV, read the whole thing back to back, and studied it for a while. I’ve never had time to catch up to the DSM-5 but I’m vaguely aware of some of the changes and I’ve read a ton of nonfiction related to mental health. As a result of this self-education, I have an extremely good track record of telling people “you should see a psychiatrist about X”. I am cautious about this sort of thing and really only tell close friends, but as far as I know my hit-rate is 100%. ↩