One potato, two potato, three potato, four
Five potato, six potato, seven potato, more.
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
“Structured Programming with
go to statements”
Computing Surveys, Vol. 6, No. 4, December 1974
Knuth’s admonition about premature optimization is such a cliché among software developers at this point that even the correction to include the full context of the quote is itself a a cliché.
Still, it’s a cliché for a reason: the speed at which software can be written is in tension — if not necessarily in conflict — with the speed at which it executes. As Nelson Elhage has explained, software can be qualitatively worse when it is slow, but spending time optimizing an algorithm before getting any feedback from users or profiling the system as a whole can lead one down many blind alleys of wasted effort.
In that same essay, Nelson further elaborates that performant foundations simplify architecture1. He then follows up with several bits of architectural advice that is highly specific to parsing—compilers and type-checkers specifically—which, while good, is hard to generalize beyond “optimizing performance early can also be good”.
So, here I will endeavor to generalize that advice. How does one provide a performant architectural foundation without necessarily wasting a lot of time on early micro-optimization?
Enter The Potato
Many years before Nelson wrote his excellent aforementioned essay, my father coined a related term: “Potato Programming”.
In modern vernacular, a potato is very slow hardware, and “potato programming” is the software equivalent of the same.
The term comes from the rhyme that opened this essay, and is meant to evoke a
slow, childlike counting of individual elements as an algorithm operates upon
them. it is an unfortunately quite common software-architectural idiom
whereby interfaces are provided in terms of scalar values. In other words,
APIs that require you to use
for loops or other forms of explicit,
individual, non-parallelized iteration. But this is all very abstract; an
example might help.
For a generic business-logic example, let’s consider the problem of monthly recurring billing. Every month, we pull in the list of all of all subscriptions to our service, and we bill them.
Since our hypothetical company has an account-management team that owns the UI
which updates subscriptions and a billing backend team that writes code to
interface with 3rd-party payment providers, we’ll create 2 backends, here
represented by some
Finally, we’ll have an orchestration layer that puts them together to actually
run the billing. I will use
async to indicate which things require a network
1 2 3 4 5 6 7 8 9 10 11
To many readers, this may look like an entirely reasonable interface specification; indeed, it looks like a lot of real, public-facing “REST” APIs. An equally apparently-reasonable implementation of our orchestration between them might look like this:
1 2 3
This is, however, just about the slowest implementation of this functionality that it’s possible to implement. So, this is the bad version. Let’s talk about the good version: no-tato programming, if you will. But first, some backstory.
My father began his career as an APL programmer, and one of the key insights he took away from APL’s architecture is that, as he puts it:
Computers like to do things over and over again. They like to do things on arrays. They don’t want to do things on scalars. So, in fact, it’s not possible to write a program that only does things on a scalar. [...] You can’t have an ‘integer’ in APL, you can only have an ‘array of integers’. There’s no ‘
loop’s, there’s no ‘
APL, like Python2, is typically executed via an interpreter. Which means, like Python, execution of basic operations like calling functions can be quite slow. However, unlike Python, its pervasive reliance upon arrays meant that almost all of its operations could be safely parallelized, and would only get more and more efficient as more and more parallel hardware was developed.
I said ‘unlike Python’ there, but in fact, my father first related this concept to me regarding a part of the Python ecosystem which follows APL’s design idiom: NumPy. NumPy takes a similar approach: it cannot itself do anything to speed up Python’s fundamental interpreted execution speed3, but it can move the intensive numerical operations that it implements into operations on arrays, rather than operations on individual objects, whether numbers or not.
This idiom is also more or less how GPU programming works. GPUs cannot operate on individual values. You submit a program5 to the GPU, as well as a large array of data6, and the GPU executes the program on that data in parallel across hundreds of tiny cores. Submitting individual values for the GPU to work on would actually be much slower than just doing the work on the CPU directly, due to the bus latency involved to transfer the data back and forth.
Back from the Backstory
This is all interesting for a class of numerical software — and indeeed it works very well there — but it may seem a bit abstract to web backend developers just trying to glue together some internal microservice APIs, or indeed most app developers who aren’t working in those specialized fields. It’s not like Stripe is going to let you run their payment service on your GPU.
However, the lesson generalizes quite well: anywhere you see an API defined in terms of one-potato, two-potato iteration, ask yourself: “how can this be turned into an array”? Let’s go back to our example.
The simplest change that we can make, as a consumer of these potato-shaped APIs, is to submit them in parallel. So if we have to do the optimization in the orchestration layer, we might get something more like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
This is an improvement, but it’s a bit of a brute-force solution; a
multipotato, if you will. We’ve moved the work to the billing service faster,
but it still has to do just as much work. Maybe even more work, because now
it’s potentially got a lot more lock-contention on its end. And we’re still
waiting for the
Subscription objects to dribble out of the
SubscriptionService potentially one request/response at a time.
In other words, we have used network concurrency as a hack to simulate a
performant design. But the back end that we have been given here is not
actually optimizable; we do not have a performant foundation. As you can
see, we have even had to change our local architecture a little bit here, to
loop parameter and a
batch_size which we had not previously
A better-designed interface in the first place would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Superficially, the implementation here looks slightly more awkward than our naive first attempt:
1 2 3 4 5 6 7 8
However, while the implementation with batching in the backend is approximately as performant as our parallel orchestration implementation, backend batching has a number of advantages over parallel orchestration.
First, backend batching has less internal complexity; no need to have a
Semaphore in the orchestration layer, or to create tasks on an event loop.
There’s less surface area here for bugs.
Second, and more importantly: backend batching permits for future optimizations within the backend services, which are much closer to the relevant data and can achieve more substantial gains than we can as a client without knowledge of their implementation.
There are many ways this might manifest, but consider that each of these services has their own database, and have got to submit queries and execute transactions on those databases.
In the subscription service, it’s faster to run a single
that returns a bunch of results than to select a single result at a time. On
the billing service’s end, it’s much faster to issue a single
UPDATE and then
COMMIT for N records at once than to concurrently issue a
ton of potentially related modifications in separate transactions.
Potato No Mo
The initial implementation within each of these backends can be as naive and
slow as necessary to achieve an MVP. You can do a
SELECT … LIMIT 1
internally, if that’s easier, and performance is not important at first. There
can be a mountain of potatoes hidden behind the veil of that batched list. In
this way, you can avoid the potential trap of premature optimization. Maybe
this is a terrible factoring of services for your application in the first
place; best to have that prototype in place and functioning quickly so that you
can throw it out faster!
However, by initially designing an interface based on lists of things rather than individual things, it’s much easier to hide irrelevant implementation details from the client, and to achieve meaningful improvements when optimizing.
This is the first post supported by my Patreon, with a topic suggested by a patron.
It’s a really good essay, you should read it. ↩
Yes, I know it’s actually bytecode compiled and then run on a custom interpreting VM, but for the purposes of comparing these performance characteristics “interpreted” is a more accurate approximation. Don’t @ me. ↩
No, not a typo, that’s a 4-digit improvement. ↩
Typically called a “shader” due to its origins in graphically shading polygons. ↩
The data may rerepresenting vertices in a 3-D mesh, pixels in texture data, or, in the case of general-purpose GPU programming, “just a bunch of floating-point numbers”. ↩