Toward a “Kernel Python”

Prompted by Amber Brown’s presentation at the Python Language Summit last month, Christian Heimes has followed up on his own earlier work on slimming down the Python standard library, and created a proper Python Enhancement Proposal PEP 594 for removing obviously obsolete and unmaintained detritus from the standard library.

PEP 594 is great news for Python, and in particular for the maintainers of its standard library, who can now address a reduced surface area. A brief trip through the PEP’s rogues gallery of modules to deprecate or remove¹ is illuminating. The python standard library contains plenty of useful modules, but it also hides a veritable necropolis of code, a towering monument to obsolescence, threatening to topple over on its maintainers at any point.

However, I believe the PEP may be approaching the problem from the wrong direction. Currently, the standard library is maintained in tandem with, and by the maintainers of, the CPython python runtime. Large portions of it are simply included in the hope that it might be useful to somebody. In the aforementioned PEP, you can see this logic at work in defense of the colorsys module: why not remove it? “The module is useful to convert CSS colors between coordinate systems. [It] does not impose maintenance overhead on core development.”

There was a time when Internet access was scarce, and maybe it was helpful to pre-load Python with lots of stuff so it could be pre-packaged with the Python binaries on the CD-ROM when you first started learning.

Today, however, the modules you need to convert colors between coordinate systems are only a pip install away. The bigger core interpreter is just more to download before you can get started.

Why Didn’t You Review My PR?

So let’s examine that claim: does a tiny module like colorsys “impose maintenance overhead on core development”?

The core maintainers have enough going on just trying to maintain the huge and ancient C codebase that is CPython itself. As Mariatta put it in her North Bay Python keynote, the most common question that core developers get is “Why haven’t you looked at my PR?” And the answer? It’s easier to not look at PRs when you don’t care about them. This from a talk about what it means to be a core developer!

One might ask, whether Twisted has the same problem. Twisted is a big collection of loosely-connected modules too; a sort of standard library for networking. Are clients and servers for SSH, IMAP, HTTP, TLS, et. al. all a bit much to try to cram into one package?

I’m compelled to reply: yes. Twisted is monolithic because it dates back to a similar historical period as CPython, where installing stuff was really complicated. So I am both sympathetic and empathetic towards CPython’s plight.

At some point, each sub-project within Twisted should ideally become a separate project with its own repository, CI, website, and of course its own more focused maintainers. We’ve been slowly splitting out projects already, where we can find a natural boundary. Some things that started in Twisted like constantly and incremental have been split out; deferred and filepath are in the process of getting that treatment as well. Other projects absorbed into the org continue to live separately, like klein and treq. As we figure out how to reduce the overhead of setting up and maintaining the CI and release infrastructure for each of them, we’ll do more of this.

But is our monolithic nature the most pressing problem, or even a serious problem, for the project? Let’s quantify it.

As of this writing, Twisted has 5 outstanding un-reviewed pull requests in our review queue. The median time a ticket spends in review is roughly four and a half days.² The oldest ticket in our queue dates from April 22, which means it’s been less than 2 months since our oldest un-reviewed PR was submitted.

It’s always a struggle to find enough maintainers and enough time to respond to pull requests. Subjectively, it does sometimes feel like “Why won’t you review my pull request?” is a question we do still get all too often. We aren’t always doing this well, but all in all, we’re managing; the queue hovers between 0 at its lowest and 25 or so during a bad month.

By comparison to those numbers, how is core CPython doing?

Looking at CPython’s keyword-based review queue queue, we can see that there are 429 tickets currently awaiting review. The oldest PR awaiting review hasn’t been touched since February 2, 2018, which is almost 500 days old.

How many are interpreter issues and how many are stdlib issues? Clearly review latency is a problem, but would removing the stdlib even help?

For a quick and highly unscientific estimate, I scanned the first (oldest) page of PRs in the query above. By my subjective assessment, on this page of 25 PRs, 14 were about the standard library, 10 were about the core language or interpreter code; one was a minor documentation issue that didn’t really apply to either. If I can hazard a very rough estimate based on this proportion, somewhere around half of the unreviewed PRs might be in standard library code.

So the first reason the CPython core team needs to stop maintaining the standard library because they literally don’t have the capacity to maintain the standard library. Or to put it differently: they aren’t maintaining it, and what remains is to admit that and start splitting it out.

It’s true that none of the open PRs on CPython are in colorsys³. It does not, in fact, impose maintenance overhead on core development. Core development imposes maintenance overhead on it. If I wanted to update the colorsys module to be more modern - perhaps to have a Color object rather than a collection of free functions, perhaps to support integer color models - I’d likely have to wait 500 days, or more, for a review.

As a result, code in the standard library is harder to change, which means its users are less motivated to contribute to it. CPython’s unusually infrequent releases also slow down the development of library code and decrease the usefulness of feedback from users. It’s no accident that almost all of the modules in the standard library have actively maintained alternatives outside of it: it’s not a failure on the part of the stdlib’s maintainers. The whole process is set up to produce stagnation in all but the most frequently used parts of the stdlib, and that’s exactly what it does.

New Environments, New Requirements

Perhaps even more importantly is that bundling together CPython with the definition of the standard library privileges CPython itself, and the use-cases that it supports, above every other implementation of the language.

Podcast after podcast after podcast after keynote tells us that in order to keep succeeding and expanding, Python needs to grow into new areas: particularly web frontends, but also mobile clients, embedded systems, and console games.

These environments require one or both of:

a completely different runtime, such as Brython, or MicroPython
a modified, stripped down version of the standard library, which elides most of it.

In all of these cases, determining which modules have been removed from the standard library is a sticking point. They have to be discovered by a process of trial and error; notably, a process completely different from the standard process for determining dependencies within a Python application. There’s no install_requires declaration you can put in your setup.py that indicates that your library uses a stdlib module that your target Python runtime might leave out due to space constraints.

You can even have this problem even if all you ever use is the standard python on your Linux installation. Even server- and desktop-class Linux distributions have the same need for a more minimal core Python package, and so they already chop up the standard library somewhat arbitrarily. This can break the expectations of many python codebases, and result in bugs where even pip install won’t work.

Take It All Out

How about the suggestion that we should do only a little a day? Although it sounds convincing, don’t be fooled. The reason you never seem to finish is precisely because you tidy a little at a time. [...] The ultimate secret of success is this: If you tidy up in one shot, rather than little by little, you can dramatically change your mind-set.

Kondō, Marie.
“The Life-Changing Magic of Tidying Up”
(p. 15-16)

While incremental slimming of the standard library is a step in the right direction, incremental change can only get us so far. As Marie Kondō says, when you really want to tidy up, the first step is to take everything out so that you can really see everything, and put back only what you need.

It’s time to thank those modules which do not spark joy and send them on their way.

We need a “kernel” version of Python that contains only the most absolutely minimal library, so that all implementations can agree on a core baseline that gives you a “python”, and applications, even those that want to run on web browsers or microcontrollers, can simply state their additional requirements in terms of requirements.txt.

Now, there are some business environments where adding things to your requirements.txt is a fraught, bureaucratic process, and in those places, a large standard library might seem appealing. But “standard library” is a purely arbitrary boundary that the procurement processes in such places have drawn, and an equally arbitrary line may be easily drawn around a binary distribution.

So it may indeed be useful for some CPython binary distributions — perhaps even the official ones — to still ship with a broader selection of modules from PyPI. Even for the average user, in order to use it for development, at the very least, you’d need enough stdlib stuff that pip can bootstrap itself, to install the other modules you need!

It’s already the case, today, that pip is distributed with Python, but isn’t maintained in the CPython repository. What the default Python binary installer ships with is already a separate question from what is developed in the CPython repo, or what ships in the individual source tarball for the interpreter.

In order to use Linux, you need bootable media with a huge array of additional programs. That doesn’t mean the Linux kernel itself is in one giant repository, where the hundreds of applications you need for a functioning Linux server are all maintained by one team. The Linux kernel project is immensely valuable, but functioning operating systems which use it are built from the combination of the Linux kernel and a wide variety of separately maintained libraries and programs.

Conclusion

The “batteries included” philosophy was a great fit for the time when it was created: a booster rocket to sneak Python into the imagination of the programming public. As the open source and Python packaging ecosystems have matured, however, this strategy has not aged well, and like any booster, we must let it fall back to earth, lest it drag us back down with it.

New Python runtimes, new deployment targets, and new developer audiences all present tremendous opportunities for the Python community to soar ever higher.

But to do it, we need a newer, leaner, unburdened “kernel” Python. We need to dump the whole standard library out on the floor, adding back only the smallest bits that we need, so that we can tell what is truly necessary and what’s just nice to have.

I hope I’ve convinced at least a few of you that we need a kernel Python.

Now: who wants to write the PEP?

🚀

Acknowledgments

Thanks to Jean-Paul Calderone, Donald Stufft, Alex Gaynor, Amber Brown, Ian Cordasco, Jonathan Lange, Augie Fackler, Hynek Schlawack, Pete Fein, Mark Williams, Tom Most, Jeremy Thurgood, and Aaron Gallagher for feedback and corrections on earlier drafts of this post. Any errors of course remain my own.

sunau, xdrlib, and chunk are my personal favorites. ↩
Yeah, yeah, you got me, the mean is 102 days. ↩
Well, as it turns out, one is on colorsys, but it’s a documentation fix that Alex Gaynor filed after reviewing a draft of this post so I don’t think it really counts. ↩