Deciphering Glyph

A collection of articles, ideas, and rambling from a guy who wrote some software that one time.

Thursday, December 27, 2012

The Twisted Way

One of the things that confuses me most about Twisted is the fact that so many people seem to be confused by things about Twisted.

Much has been written, some of it by me, some of it by other brilliant members of the community, attempting to explain Twisted in lots of detail so that you can use it and understand it and control it to do your bidding.  But today, I'd like to try something different, and instead of trying to help you figure out how to use Twisted, I will try to help you understand what Twisted is.  To aid you in meditating upon its essence and to understand how it is a metaphor for software, and, if you are truly enlightened, for all life.

Let us contemplate the Twisted Way.


Image Credit: Ian Sane

In the beginning was the Tao.
All things issue from it; all things return to it.

All information systems are metaphors for the world.  All programs are, at least in small part, systems.  Therefore, every program contains within it a world of thought.  Those thoughts must be developed before they can be shared; therefore, one must have an interesting program before one thinks to do any interesting I/O.

It is less fashionable these days to speak of "object-oriented modeling" than it once was, but mostly because object-oriented design is now so pervasive that no-one needs convincing any more.  Nevertheless, that is what almost all of us do.  When an archetypical programmer in this new millennium sets out to create a program, they will typically begin by creating a Class, and then endowing that Class with Behavior; then, by creating an Instance of that Class, and bestowing interesting Data upon that Instance.

Such an Instance receives (as input) method calls from some other object, and produces (as output) method calls on some other object.  It is a system unto itself, simulating some aspect of human endeavor, computing useful results and dispensing them, all in an abstract vacuum.

But, the programmers who produced this artifact desires it to interact with the world; to produce an effect, and therefore to accept Input and dispense Output.

It is at this point that the programmer encounters Twisted.

When you look for it, there is nothing to see.
When you listen for it, there is nothing to hear.
When you use it, it is inexhaustible.


Except that, in fact, nobody ever encounters Twisted this way.  If this is where – and how – you encounter Twisted, then you will likely have great success with it.  But everyone tends to encounter Twisted, like one encounters almost every other piece of infrastructure, in medias res.  Method calls are flying around all over the place in some huge inscrutable system and you just have to bang through the tutorial to figure it all out right now, and it looks super weird.

Over the years, so many questions I've answered about Twisted seem to reduce to: "how do I even get this thing to do anything"?

This is Twisted's great mystery: it does nothing.  By itself, it is the world's largest NOP.  Its job, purely and simply is to connect your object to the world.  You tell Twisted: listen for connections on this port; when one is made, do this.  Make this request, and when it has a response, do that.  Listen for email over SMTP; when one arrives, do the other.

Without your direction, reactor.run will just ... wait.

The source of most confusion with Twisted, I believe, is that few objects are designed in this idiom.  When we seek to create a program, we feel we must start interacting with it immediately, before it even knows what it is supposed to do.  The seductions of blocking I/O are many and varied.  Any function which appears to merely compute a result can simply be changed to cheat and get its answer by asking some other system instead, with its callers none the wiser.  Even for those of us who know better, these little cheats accumulate and make the program brittle and slow, and force it to be spun out into a thread, or the cold, sparse desert of its own separate process, so it may tediously plod along, waiting for the response to its each and every query.

Thus, desire (for immediate I/O) leads to suffering (of the maintenance programmer).

Return is the movement of the Tao.
Yielding is the way of the Tao.

It doesn't need to be that way, though.  When you create an object, it is best to create it as independently as possible; to test it in isolation; to discretely separate its every interaction with the outside world so that they may be carefully controlled, monitored, intercepted and inspected, one iteration at a time.

All your object needs to do is to define its units of work as atomic, individual functions that it wishes to perform; then, return to its caller and allow it to proceed.

The Master does his job and then stops.
He understands that the universe is forever out of control.

When you design your objects by contemplating their purpose and making them have a consistent, nicely separated internal model of what they're supposed to represent, Twisted seems less like a straightjacket, contorting your program into some awkward shape.  Instead, it becomes a comfortable jacket that your object might slip on to protect itself from the vicissitudes of whatever events may assault it from the network, whether they be the soft staccato of DNS, the confused warbling of SIP or the terrifying roar of IMAP.


Better yet, your object can be a model of some problem domain, and will therefore have a dedicated partner; a different object, a wrapper, whose entire purpose is to translate from a lexicon of network-based events, timers, and Deferred callbacks, into a language that is more directly applicable to your problem domain.  After all, each request or response from the network means something to your application, otherwise it would not have been made; the process of explicitly enumerating all those meanings and recording and documenting them in a module dedicated to that purpose is a very useful exercise.

When your object has such unity of purpose and clarity of function, then Twisted can help manage its stream of events even if the events are not actually coming from a network; abstractions like deferreds, producers, consumers, cooperators and inline callbacks can be used to manipulate timers, keystrokes, and button clicks just as easily as network traffic.

True words aren't eloquent; eloquent words aren't true.
Wise men don't need to prove their point; men who need to prove their point aren't wise.

So, if you are setting out to learn to use Twisted, approach it in this manner: it is not something that will, itself, give your object's inner thoughts structure, purpose and meaning.  It is merely a wrapper; an interstitial layer between your logic and some other system.  The methods it calls upon you might be coming from anywhere.  And indeed, they should be coming from at least one other place: your unit tests.

(With apologies to Geoffrey James and Laozi.)

Monday, October 22, 2012

A Tired Hobgoblin

Alternate (Boring) Title: Why the Twisted coding standard is better than PEP8 (although you still shouldn't care)

People often ask me why Twisted's coding standard – camel case instead of underscores for method names, epytext instead of ReST for docstrings, underscores for prefixes – is "weird" and doesn't, for example, follow PEP 8.

First off, I should say that the Twisted standard actually follows quite a bit of PEP 8.  PEP 8 is a long document with many rules, and the Twisted standard is compatible in large part.  For example, pretty much all of the recommendations in the section on pointless whitespace.

Also, the primary reason that Twisted differs at all from the standard practice in the Python community is that the "standard practice" was almost all developed after Twisted had put its practices in place.  PEP 8 was created on July 5, 2001; at that point, Twisted had already existed for some time, and had officially checked in its first coding standard just a smidge over one month earlier, on May 2, 2001.

That's where my usual explanation ends.  If you're making a new Python project today, unless it is intended specifically as an extension for Twisted, you should ignore the relative merits of these coding standards and go with PEP 8, because the benefits of consistency generally outweigh any particular benefits of one coding standard or another.  Within Twisted, as PEP 8 itself says, "consistency within a project is even more important", so we're not going to change everything around just for broader consistency, but if we were starting again we might.

But.

There seems to be a sticking point around the camelCase method names.

After ten years of fielding complaints about how weird and gross and ugly it is – rather than just how inconsistent it is – to put method names in camel case, I feel that it is time to speak out in defense of the elegance of this particular feature of our coding standard.  I believe that this reaction is based on Python programmers' ancestral memory of Java programs, and it is as irrational as people disliking Python's blocks-by-indentation because COBOL made a much more horrible use of significant whitespace.

For starters, camelCase harkens back to a long and venerable tradition.  Did you camelCase haters-because-of-Java ever ask yourselves why Java uses that convention?  It's because it's copied from the very first object-oriented language.  If you like consistency, then Twisted is consistent with 34 years of object-oriented programming history.

Next, camelCase is easier to type.  For each word-separator, you have only to press "shift" and the next letter, rather than shift, minus, release shift, next letter.  Especially given the inconvenient placement of minus on US keyboards, this has probably saved me enough time that it's added up to at least six minutes in the last ten years.  (Or, a little under one-tenth the time it took to write this article.)

Method names in mixedCase are also more consistent with CapitalizedWord class names.  If you have to scan 'xX' as a word boundary in one case, why learn two ways to do it?

Also, we can visually distinguish acronyms in more contexts in method names.  Consider the following method names:
  • frog_blast_the_vent_core
  • frogBLASTTheVentCore
I believe that the identification of the acronym improves readability. frog_blast_the_vent_core is just nonsense, but frogBLASTTheVentCore makes it clear that you are doing sequence alignment on frog DNA to try to identify variations in core mammalian respiration functions.

Finally, and this is the one that I think is actually bordering on being important enough to think about, Twisted's coding standard sports one additional feature that actually makes it more expressive than underscore_separated method names.  You see, just because the convention is to separate words in method names with capitalization, that doesn't mean we broke the underscore key on our keyboards.  The underscore is used for something else: dispatch prefixes.

Ironically, since the first letter of a method must be lower case according to our coding standard, this conflicts a little bit with the previous point I made, but it's still a very useful feature.

The portion of a method name before an underscore indicates what type of method it is.  So, for example:
  • irc_JOIN - the "irc_" prefix on an IRC client or server object indicates that it handles the "JOINED" message in the IRC protocol
  • render_GET - the "render_" prefix on an HTTP resource indicates that this method is processing the GET HTTP method.
  • remote_loginAnonymous - the "remote_" prefix on a Perspective Broker Referenceable object indicates that this is the implementation of the PB method 'loginAnonymous'
  • test_addDSAIdentityNoComment - the "test_" prefix on a trial TestCase indicates that this is a test method that should be run automatically. (Although for historical reasons and PyUnit compatibility the code only actually looks at the "test" part.)
The final method name there is a good indication of the additional expressiveness of this naming convention.  The underscores-only version – test_add_dsa_identity_no_comment – depends on context.  Is this an application function that is testing whether we can add a ... dissah? ... identity with no comment?  Or a unit test?  Whereas the Twisted version is unambiguous: it's a test case for adding a D.S.A. identity with no comment.  It would be very odd, if not a violation of the coding standard, to name a method that way outside of a test suite.

Hopefully this will be the last I'll say on the subject.  Again, if you're starting a new Python project, you should really just go ahead and use PEP 8, this battle was lost a very long time ago and I didn't even really mind losing it back then.  Just please, stop telling me how ugly and bad this style is.  It works very nicely for me.

Sunday, October 07, 2012

The Lexicology of Personal Development

These days, everybody talks about geeks.  Geek chic, the "age of the geek"; even the New York Times op-ed page has been talking about the rise of "geeks" for years.  Bowing to popular usage, even I use the word as it's currently being bandied about.  But I think that the real success story is that of nerds.

A pernicious habit I've noticed in the last decade of the growth of geek culture is that it has developed a sort of cargo-cult of meritocracy.  Within the self-identified "geek" community, there's a social hierarchy based on all kinds of ridiculous pop-culture fetishism.  Who knows the most Monty Python non-sequiteurs?  Who knows the most obscure Deep Space Nine trivia?  This is hardly a new thing – William Shatner famously complained about it on Saturday Night Live in 1986 – but the Internet has been accelerating the phenomenon tremendously.  People who had a difficult time in their teens find each other as adults through some fan-club interest group, and then they make fast friends who had similar social problems.  Soon, since that's the shared interest that they know all their friends from, they spend all their time in the totally fruitless pursuit of more junk related to some frivolous obsession.  That can be okay, almost healthy even, if the focus of this accumulation is a productive hobby. However, if it's just a pop-culture franchise (Harry Potter, Star Trek, World of Darkness) what was originally a liberating new social landscape can rapidly turn into a suffocating, stale dead-end for personal development.

So I always feel a twinge when I identify myself as a "geek".  I usually prefer to say that I am - or at least aspire to be - a nerd.

A nerd is someone who is socially awkward because they are more thoughtful, introspective, intelligent or knowledgeable than their peers.  They notice things that others don't, and it makes interaction difficult.  This is especially obvious in younger nerds, where they're a little above their age group's intelligence but not quite intelligent enough to know when to keep their mouths shut to avoid ostracism.  But, even if they have learned to keep a lid on their less-popular observations, it's tough to constantly censor yourself and it makes interaction with your peers less enjoyable.

A geek is someone who is socially awkward because they are obsessed with topics that the mundanes among us just don't care about that much. They collect things, whether it's knowledge, games, books, toys, or technology.  Faced with a popular science fiction movie, a nerd might want to do the math to see whether the special effects are physically plausible, but a geek will just watch it a dozen times to memorize all the lines.

A dork is just socially awkward because they just aren't all that pleasant to be around.  Nerds and geeks have trouble with interacting with others because they're lost in their own little worlds of intellectual curiosity or obsession: dorks are awkward because, let's face it, maybe they're a little stupid, a little mean, and just not that interesting.  A dork is unsympathetic.

By way of a little research for this post, I discovered that I'm apparently not the only one who has this impression of the definitions, and even Paul Graham seems to agree with me on word choice.  Still: from here on out, these are the correct definitions of the words, thank you very much.

Maybe you've heard these definitions before, and this is all old news. Also, these are words for the sort of tedious taxonomy of people that fictional teenagers in high-school movies do.  It's obviously not karmically healthy to start labeling people "nerd", "dork",  and "geek" and then writing them off as such.  So, you might ask, why do I bring it up?

Because you, like me, are almost certainly a nerd, a geek, and a dork.  And, as you might have inferred from my definitions above, nerds are better than geeks, and dorks are worse than both.

First, consider your inner nerd.  It's good to be intellectually curious, to stretch your cognitive abilities in new and interesting ways, to learn things about how systems work.  Physical systems, social systems, technological systems: it's always good to know more.  It's even good to be curious to the point of awkwardness, especially if you're a kid who is concerned about awkwardness; don't worry about it, it'll make you more interesting later.  It's good to foster any habits which are a little nerdy.

Second, your inner geek.  It's okay to enjoy things, even to obsess about them a little bit, but I think that our culture is really starting to overdo this.  Geeks are presented in popular media as equally, almost infinitely, obsessed with Star Wars, calculus, Star Trek, computer security, and terrible food (cheese whiz, sugary soda brands, etc).  No real people actually have time for all this stuff.  At some point, you have to choose whether you're going to memorize Maxwell's or Kosinski's equations.

One way that you can keep your inner geek in check is to always ask yourself the question: am I watching this movie / playing this game / reading this book because I actually enjoy it and I think it's worthwhile, or am I just trying to make myself conform to some image of myself as someone who knows absolutely everything about this one little cultural niche?

There are people who will treat being a fan of something that someone else created as morally equivalent (or, in a sense, even better than) creating something yourself, and those people are not doing you any favors.  Do not pay attention to them.

Of course, there's some overlap.  People who like playing with systems in real life enjoy the fluffier, more lightweight intellectual challenges of playing with the rules of fictional universes, especially the ones from speculative fiction.  When I was a kid, I went to a couple of Star Trek conventions and let me tell you, there were some legit nerds there; astrophysicists, rocket scientists, and experimental chemists, all excitedly talking about how they were inspired to pursue their careers by fiction of various kinds.

So go ahead, take a break, and geek out. Just don't tell yourself that it's anything other than for fun.

Finally, your inner dork.

As you're enthusiastically cultivating your nerdiness and carefully managing your geekiness, you will be accumulating a little bit of dorkiness as you go: at some point you have to make decisions about whether to do some minor social obligation in order to spend some time on learning a new thing (or re-watching your favorite movie).  You have to decide whether to restrain yourself so you can listen to your friend talk about a rough day at their job or to start spouting facts about the progress of the repairs on the large hadron collider.

Sometimes, on balance, it's acceptable to be a little bit inconsiderate in the pursuit of something more important.  People worth being friends with will see that and understand.  Heck, practically every movie plot these days puts at least one awkward and abrasive nerd in a sympathetic and even heroic position.  But be careful: once you decide that social graces are your lowest priority, it's a hop skip and a jump from being a lovable but absent-minded genius to being a blathering blowhard who just will not shut up about some tedious Riemannian manifold crap that nobody cares about even we just told them that somebody died.

The goal of the nerd or the geek, after all, is not to be awkward; it's easy to forget sometimes that that is an unintentional and unpleasant side effect of the good parts of those attributes.  Being a dork is just bad.  After all, if you're so smart, why aren't you nice?

Friday, July 13, 2012

Simple Made Variadic

Last night I made a snarky tweet about how Clojure is doomed.  Out of context, it didn't really make a lot of sense, and Ivan Krstić replied asking what the heck I was talking about.  I tried to fit the following into a tweet but it kinda broke the tweeterizer I was using right in half, and so I had to put it here.

I love a good snark as much as the next person - some might say more - but it really bothers me when people make snide comments denigrating others' free work without at least offering a cogent criticism to go with it, and I don't want to be that guy.  So, hopefully before the whole Clojure community finds said tweet and writes me off as an arrogant Python bigot, I would like to explain what I meant in a bit more detail.

Right off the bat I should say that this was a bit tongue-in-cheek.  I actually rather like Clojure and I think Rich Hickey has some very compelling ideas about programming in general.  I watch his talk "Simple Made Easy" once every month or two, contemplating its deeper meaning, and I still usually come away with an insight or two.

I should also make it clear that I was linking to the recur special form specifically, and not just the special forms documentation in general.  Obviously having reference docs isn't a bad thing.

Ivan, (or should I say, "@radian"?) you may be right; that documentation you linked to may indeed one day spell Python's doom.  If Python does eventually start to suck, it will be because it collapsed under the weight of its own weird edge cases like the slightly-special behavior of operator dispatch as compared to method dispatch, all the funkiness of descriptors, context managers, decorators, metaclasses, et cetera.

A portion of my point that was serious, though.  The documentation for recur does highlight some problems with Clojure that the Python docs can play an interesting counterpoint to.

The presence of the recur form at all is an indication of the unhealthy level of obsession that all LISPs have with recursion.  We get it: functions are cool. You can call them. They can call themselves. Every other genre of language manages to use this to a reasonable degree of moderation without adding extra syntactic features to their core just so you can recurse forever without worrying about stack resources. Reading this particular snipped of documentation, I can almost hear Rich Hickey cackling as he wrote it, having just crowned himself God-Emperor of the smug LISP weenies, as he gleefully points out that Scheme has it wrong and the CL specifications had it wrong with respect to tail call elimination, and that it should be supported by the language but also be explicit and compiler-verified.

The sad thing is, my hypothetical caricature of Clojure's inventor is actually right! This is a uniquely clever solution to a particularly thorny conceptual problem with the recursive expression of algorithms. The Scheme folks and the Common Lisp folks did both kinda get it wrong.  But the fact that this has to be so front-and-center in the language is a problem. Most algorithms shouldn't be expressed recursively; it is actually quite tricky to communicate about recursive code, and anyway most systems that really benefit from it have to be mutually, dynamically reentrant anyway and won't be helped by tail call elimination.  (My favorite example of this is still the unholy shenanigans that Deferreds have to get up to to make sure you don't have to care about callback chain length or Deferred return nesting depth.)

Also, if you want to be all automatically parallelizable and web scale and "cloud"-y, recursion and iteration are both the wrong way to do it; they're both just ways of tediously making your way down a list of things one element at a time.  What you want to do is to declaratively apply a computation to your data in such a way as to avoid saying anything about the order things have to happen in.  To put it more LISPily, (map) is a better conceptual foundation for the future than (loop) or (apply).  Of course you can do the naive implementation of (map) with (recur), but smarter implementations need application code to be written some other way.

The language style choices of the manual in this case is also telling. The Python docs that Ivan linked to go into excruciating detail, rephrasing and explaining the same concept in a few different ways, linking to other required concepts in depth so the reader can easily familiarize themselves with any prerequisites, while still essentially explaining a nerdy part of the language that you can ignore while still using it productively. Every Python programmer ignores descriptors while they're learning to write classes and methods, despite that they're using them all the time; this ability to be understood at different levels of complexity is a strength of every good language, and python does particularly well in that regard. Of course one could also make the case that this is just because Python has so many dusty corners hidden behind double-underscores, and a better language would just have less obscure junk in it, not make understanding the obscure junk optional, but I digress.

The description of recur, by contrast, is deeply flawed.  It is terse, to a fault. It introduces the concept of "recursion points" without linking to any kind of general reference.  It uses abbreviations all over the place ("exprs", "params", "args", "seq") without even using typesetting to illuminate whether they are using a general term or a specific variable name.

But, by far, the worst sin of this document is the use of the words "variadic" and "arity". There is really no excuse for the use of these words, ever. Take it from me, I am exactly the kind of pedantic jerk who will drop "arity" into a conversation about, for example, music theory, just to demonstrate that I can, and as that kind of jerk I can tell you with certainty: I have no excuse.

It should say: "a function that takes a variable number of arguments". Or possibly: "it must take exactly the same number of arguments as the recursion point".

This was particularly disappointing example to me because Clojure strikes me as a particularly... for lack of a better word, "optimistic" lisp, one that looks forward rather than back, one that is more interested in helping people find comprehensible and composeable ways to express programs than in revisiting obscure debates over reader macros or lisp-1/lisp-2 holy wars.  But the tone of the documentation for recur aims it straight at the smug lisp weenie crowd.

As I hope is obvious, if not initially, then at least by now, I don't think that Clojure will fail (or succeed) on the merits of one crummy piece of documentation. It's a much younger language than Python, so it may have a ways to go in its documentation practices. It also comes from an illustrious heritage that I can't expect to see none of in the way that it talks about itself, no matter how unfortunate certain details of that heritage are.  Heck, at the parallel point in Python's lifetime, it didn't even have descriptors yet, let alone the documentation for them!

Still, I don't think that this issue is entirely trivial, and I hope that the maintainers for the documentation for Clojure, at least the documentation for the parts of the language you have to see every day, take care to improve its accessibility to the less arcane among us.

Friday, April 13, 2012

We'll Always Have Cambridge

Half-way through 2012, I will be leaving the east coast.

There are a great many things I despair of leaving behind; family, friends, the most excellent Boston Python Meetup, participating in the sometimes incendiary, sometimes hilarious Cambridge, Massachusetts / Cambridge, England rap war.

However, I'm not writing today in order to wax lyrical about the area, or to extoll the virtues of my new home, but hopefully, to prevent a missed opportunity.  I know there are at least a few really cool people in Massachusetts who read this blog, and who read my tweets, that I either haven't seen in quite a while or have never actually met in person.

So if grabbing a coffee with me is an interesting idea to you, please drop me a line within the next month. I would love to hear your story about how PHP ruined your summer, or how Twisted changed your life, or how you once pwned a vending machine with nothing but a malformed JPEG.

I'm sure I'll visit the area from time to time, so this isn't quite your last chance, but it just won't be the same, you know?

If I follow you, you can DM me on Twitter of course, but my email address isn't hard to figure out either.  If you glance up towards the top of your browser window right now, you're practically looking at it.

Sunday, February 12, 2012

This Isn't How PyPy Works, But it Might as Well Be

It seems like a lot of the Python programmers I speak with are deeply confused by PyPy and can't understand how it works.  The stereotypical interlocutor will often say things like: A Python VM in Python?  That's just crazy!  How can that be fast?  Isn't Python slower than C?  Aren't all compilers written in C?  How does it make an executable?

I am not going to describe to you how PyPy actually works.  Lucky for you, I'm not smart enough to do that.  But I would like to help you all understand how PyPy could work, and hopefully demystify the whole idea.

The people who are smart enough to explain how PyPy actually works will do it over at the PyPy blog.  At some level it's really quite straightforward, but this impression of straightforwardness is not conveyed well by posts with titles like "Optimizing Traces of the Flow Graph Language".  In addition to being a Python interpreter in Python, PyPy is a mind-blowingly advanced exploration of the cutting-est cutting-edge compiler and runtime technology, which can make it seem complex. In fact, the fact that it's in Python is what lets it be so cutting-edge.

Most people with a formal computer science background are already familiar with the fairly generic nature of compilers, as well as the concept of a self-hosting compiler.  If you do have that background, then that's all PyPy is: a self-hosting compiler.  The same way GCC is written in C, PyPy is written in Python.  When you strip away the advanced techniques, that's all that's there.

A lot of folks who are confused by PyPy's existence, though, I suspect don't have that background; many working programmers these days don't.  Or if they do, they've forgotten it, because the practical implications of the CSS box model are so complex that they squeeze simpler ideas, like turing completeness and the halting problem, out of the average human brain.  So here's the easier explanation.

A compiler is a program that turns a string (source code: your program text written in Python, C, Ruby, Java, or whatever) into some kind of executable code (bytecode or runtime interpreter operations or a platform-native executable).

Let's examine that last one, since it seems to be a sticking point for most folks.  A platform-native executable is simply a bunch of bytes in a file. There's nothing magic about it.  It's not even a particularly complex type of file.  It's a packed binary file, not a text file, but so are PNGs and JPEGs, and few programmers find it difficult to believe that such files might be created by Python.  The formats are standard and very long-lived and there are tons of tools to work with them.  If you're curious, even Wikipedia has a good reference for the formats used by each popular platform.

As to Python being slower than C: once a program has been transformed into executable code, it doesn't matter how slow the process for translating it was: the running program is now just executable instructions for your CPU, so it doesn't matter that Python is slower than C, because it was just the compiler that was in Python, and by the time your program is running, the original Python has effectively vanished and all you're left with is your program executing.

(Actually, Python is faster than C anyway, especially at producing strings.)

In reality, PyPy takes a hybrid approach, where it is a program which produces a program and then does some stuff to it and creates some C code which it compiles with the compiler of your choice and then creates some code which then creates other code and then puts it into memory, not a file, and then executes it directly, but all of that is ancillary tricks and techniques to make your code run faster, not a fundamental property of the kind of thing that PyPy is.  Plus, as I said, this article isn't actually about how PyPy works anyway, it's just about how you should pretend it works.  So you should ignore this whole paragraph.

For the sake of argument, assume that you know all the ins and outs of binary executable formats for different operating systems, and the machine code for various CPU architectures.  The question you should really ask yourself is: if you have to write a program (a compiler) which translates one kind of string (source code) into another kind of string (a compiled program): would you rather write it in C or Python?  What if the strings in question were a template document and an HTML page?

It shouldn't be surprising that PyPy is written in Python.  For the same reasons that you might use Django templates and not snprintf for generating your HTML, it's easier to use Python than C to generate compiled code.  This is why PyPy is at the forefront of so many advanced techniques that are too sophisticated to cover in a quick article like this.  Since the compiler is written in a higher-level language, it can do more advanced things, since lower-level concerns can be abstracted away, just as they are in your own applications.

Friday, January 20, 2012

The Concurrency Spectrum: from Callbacks to Coroutines to Craziness


Concurrent programming idioms are on a spectrum of complexity.

Obviously, writing code that isn't concurrent in any way is the easiest.  If you never introduce any concurrent tasks, you never have to debug any problems with things running in an unexpected order.  But, in today's connected world, concurrency of some sort is usually a requirement.  Each additional point where concurrency can happen introduces a bit of cognitive overhead, another place you need to think about what might happen, so as a codebase adds more of them it becomes more difficult to understand them all, and it becomes more challenging to understand subtle nuances of parallel execution.

So, at the simplest end of the spectrum, you have callback-based concurrency.  Every time you have to proceed to the next step of a concurrent operation, you have to create a new function and new scope, and pass it to the operation so that the appropriate function will be called when the operation completes.  This is very explicit and reasonably straightforward to debug and test, but it can be tedious and overly verbose, especially in Python where you have to think up a new function name and argument list for every step.  The extra lines for the function definition and return statement can be an impediment to quickly understanding the code's intentions, so what facilitates understanding of the concurrency model can inhibit understanding of the code's actual logical purpose, depending on how much concurrent stuff it has to do.  Twisted's Deferreds make this a bit easier than raw callback-passing without fundamentally changing the execution dynamic, so they're at this same level.

Then you have explicit concurrency, where every possible switch-point has to be labeled somehow.  This is yield-based coroutines, or inlineCallbacks, in Twisted.  This is more compact than using callbacks, but also more limiting.  For example, you can only resume a generator once, whereas you can run a callback multiple times.  However, for a logical flow of sequential concurrent steps, it reads very naturally, and is shorter, as it collapses out the 'def' and 'return' lines, and you have to think of at least two fewer names per step.

However, that very ease can be misleading.  You might gloss over a 'result = yield ...' more easily than a 'def whatever(result): return result; something(whatever)'.  Nevertheless, if you have 'yield's everywhere you might swap your stack, then when you have a concurrency bug, you can look at any given arbitrary chunk of code and know that you don't need any locks in it, as long as you can't see any yield statements.  Where you do see yield statements, you know that you have some code that needs to be inspected.

To continue down that spectrum, a cooperatively multithreading program with implicit context switches makes every line with any function call on it (or any line which might be a function call, like any operator which can be overridden by a special method) a possible, but not likely culprit.  Now when you have a concurrency bug you have to audit absolutely every line of code you've got, although you still have a few clues which will help you narrow it down and rule out certain areas of the code.  For example, you can guess that it would be pathological for 'x = []; ...; x.append(y)' to context switch. (Although, given arbitrary introspection craziness, it is still possible, depending on what "..." is.)  This is way more lines than you have to consider with yield, although with some discipline it can be kept manageable.  However, experience has taught me that "with some discipline" is a code phrase for "almost never, on real-life programming projects".

All the way at the end of the spectrum of course you have preemptive multithreading, where every line of code is a mind-destroying death-trap hiding every possible concurrency peril you could imagine, and anything could happen at any time.  When you encounter a concurrency bug you have to give up and just try to drink your sorrows away.  Or just change random stuff in your 'settings.py' until it starts working, or something.  I never really did get comfortable in that style.  With some discipline, you can manage this problem by never manipulating shared state, and only transferring data via safe queueing mechanisms, but... there's that phrase again.

Some programming languages, like Erlang, support efficient preemptive processes with state isolation and built-in super-cheap super-fast queues to transfer immutable values.  (Some other languages call these "threads" anyway, even though I would agree with Erlang's classification as "processes".)  That's a different programming model entirely though, with its own advantages and challenges, which doesn't land neatly on this spectrum; if I'm talking about left and right here, Erlang and friends are somewhere above or below.  I'm just describing Python and its ilk, where threads give you a big pile of shared, mutable state, and you are constantly tempted to splash said state all over your program.

Personally I like Twisted's style best; the thing that you yield is itself an object whose state can be inspected, and you can write callback-based or yield-based code as each specific context merits.  My opinion on this has shifted over time, but currently I find that it's best to have a core which is written in the super-explicit callback-based approach with no coroutines at all, and then high-level application logic which wraps that core using yield-based coroutines (@inlineCallbacks, for Twisted fans).

I hope that in a future post, I may explain why, but that would take more words than I've got in me tonight.

Thursday, December 15, 2011

I'm Sorry It's Come To This

If you want to be a great leader,
you must learn to follow the Tao.
Stop trying to control.
Let go of fixed plans and concepts,
and the world will govern itself.


I usually try not to get too political in my public persona – on blogs, twitter, IRC, mailing lists et cetera – and that's a conscious choice.

I work on open source software.  I have for the last ten years.  I am lucky enough to have founded a project of my own, but in open source, leaders are more beholden to their followers than vice versa.  I depend on people showing up to effectively work for me, for free, on a regular basis.  So, I try to avoid politics not because I don't have strong convictions (anyone who knows me personally can tell you that I certainly do) but because I don't want someone to avoid showing up and helping do some good in the world in one area, just because we might disagree in another.

This is a benefit of living in a free and democratic society: we have ways to dispute issues that we have strong feelings about, so we can cooperate on some things without having to agree on everything.  It's rarely perfect but we can usually get some good stuff done, with rough consensus and running code.

Today though, there's a political issue which I can't ignore.  The purpose of Twisted (the open source project which I founded) is to facilitate the transfer of information across the Internet.  A new law, SOPA, is threatening to radically alter the legal infrastructure of the Internet in the United States, granting sweeping new powers to copyright cartels and fundamentally restricting the legal right to transfer any information, and to build tools that transfer it.  Twisted is designed to make it easy to implement new protocols, to easily experiment with improvements to systems like the Domain Name System.  SOPA might well make those potential improvements, and with only a little paranoid fantasizing, Twisted itself, illegal.

It's my view that this law is a blatantly unconstitutional restriction on free speech.  It will kill job creation, at a time when our nation can scarce afford another blow to its economy.  It will create the infrastructure to suppress political dissent, similar to the infrastructure in China and Syria, at a time when our corrupt political system needs dissent more than ever.  It is the wrong thing at the wrong time.

This bill is being discussed in the house today.  If you're in the US, call your representative right now.

(As always, I don't speak for anyone but myself; no one else has reviewed or endorsed these remarks.)

Friday, November 04, 2011

Blocking vs. Running

I've heard tell of some confusion lately around what the term "non-blocking" means.  This isn't the first time I've tried to explain it, and it certainly won't be the last, but blogging is easier than the job Sisyphus got, so I can't complain.

A thread is blocking when it is performing an input or output operation that may take an unknown amount of time.  Crucially, a blocking thread is doing no useful work.  It is stuck, consuming resources - in particular, its thread stack, and its process table entry.  It is sucking up resources and getting nothing done.  These are resources that one can most definitely run out of, and are in fact artificially limited on most operating systems, because if one has too many of them, the system bogs down and becomes unusable.

A thread may also be "stuck" doing some computationally intensive work; performing a complex computation, and sucking up CPU cycles.  There is a very important distinction here, though.  If that thread is burning up CPU, it is getting work done.  It is computing.  This is why we have computers: to compute things.

It is of course possible for a program to have a bug where a program goes into an infinite loop, or otherwise performs work on the CPU without actually getting anything useful to the user done, but if that's happening then the program is just buggy, or inefficient.  But such a program is not blocking: it might be "thrashing" or "stuck" or "broken", but "blocking" means something more specific: that the program is sitting around, doing nothing, while it is waiting for some other thing to get work done, and not doing any of its own.

A program written in an event-driven style may be busy as long as it needs to be, but that does not mean it is blocking.  Hence, event-driven and non-blocking are synonyms.

Furthermore, non-blocking doesn't necessarily mean single-process.  Twisted is non-blocking, for example, but it has a sophisticated facility for starting, controlling and stopping other processes.  Information about changes to those processes is represented as plain old events, making it reasonably easy to fold the results of computation in another process back into the main one.

If you need to perform a lengthy computation in an event-driven program, that does not mean you need to stop the world in order to do it.  It doesn't mean that you need to give up on the relatively simple execution model of an event loop for a mess of threads, either.  Just ask another process to do the work, and handle the result of that work as just another event.

Saturday, September 10, 2011

2L2T: DjangoCon Feedback

I've been having a great time over here at DjangoCon, but now that I've had an opportunity to relax and process some feedback from my talk, I have noticed a couple of themes to that feedback.  This isn't really a full article, just a response, but it's too long to tweet.  If you're curious about the talk, I believe it will be showing up on blip.tv under http://blip.tv/djangocon somewhere next week.  (I'll try to remember to update this post when it's available.)
For the most part, the talk was exceedingly well-received and I want to thank the Django community both for the opportunity to speak and for the overwhelmingly positively response.  Thanks for making an outsider to your community feel welcome and appreciated.
There have been a couple misconceptions though, and perhaps I didn't express myself clearly on a few points.
  1. I realize that there are times – plenty of times, even – when using some component that's in a different language from your main application is the right choice.  I wasn't trying to say "all Python all the time no matter what, no exceptions".  I just want you all to consider that there is a cost to using a component that's in a different language, and you should be aware of that cost.  It's not as simple as a tick-a-box feature comparison of the features and drawbacks of multiple products.  If I came out as sounding really extreme on this, it just was to provoke a response.
  2. You can have an architecture which is driven by Python and organized by Python without actually having all the implementation be in Python.  For example, an inordinate number of people asked me about memcache.  If you want using something like that, sure, use memcache, there's not a lot that it being in Python would buy you.  Some might say that the whole point of memcache is that it isn't very deeply configurable and doesn't have much in the way of behavior.  Plus, it's an internal component, not an externally visible service, so even my usual flimsy "no buffer overflows" argument doesn't really hold up; it's more like a library than a server.  You can incorporate memcache into a Python-in-the-driver's-seat architecture by spawning memcache from your Python process instead of making memcache a configuration dependency.  That way, you don't need a separate configuration file and a separately managed service or a chef script that boots memcache for you before your application.  This applies equally well to any other, similar services: write their config files from your Python code, and start them automatically.
Finally, thanks to everyone who really thought about what I said, took the time to respond, and prompted me to write this.

Update: The video of my talk is now available on blip.tv.

Sunday, June 19, 2011

ἁγιολογία for r0ml

I have the rare distinction of being a second-generation software developer.  Most recently, I mentioned this in an interview when asked who my programming heroes are.  It might sound kind of corny, but I'm serious when I say that my father is my programming hero.
Robert "r0ml" Lefkowitz
My dad had a cool hacker alias in the seventies.  He's been known as "r0ml" around the web since before there was a web. If you are in a particularly typographically hip part of the internet, it might even be "RØML".  How many of your parents have a nom de plume with a digit, or a non-ASCII character in it?  Or, for that matter, any kind of hacker pseudonym?
I had the good fortune to work with one of r0ml's colleagues, Amir Bakhtiar.  Amir paid me one of the highest compliments I've ever received: he said that the code for systems I've worked on is similar to r0ml's in its style and exposition.  My dad taught me how to program in x86 assembler, and in that process, I learned a lot about the way he thought about solving problems and building systems.  I regard thinking that well, or even comparably well, as a real achievement.
That's not to say that I would do everything exactly the way that he does.  For example, he writes a lot of networking code in Java.  He doesn't use Twisted, for the most part.  If you know me and you know my dad, you know that we disagree on plenty of stuff.
Unlike the stereotypical, often-satirized filial argument, these discussions are something I look forward to.  Disagreeing with my dad is still one of the most intellectually challenging activities I've ever engaged in.  Whenever I have a conversation with him about a topic where he has a different view, I come away enlightened – if not necessarily convinced.
Conversations among my friends occasionally turn to the topic of our respective upbringings, as they do in any close group.  One of the recurring themes of my childhood is that, while my siblings and I were sometimes told to be quiet, we were never told to be quiet because our opinions weren't valuable.  Sometimes we were told in unequivocal terms that we were wrong, of course.  However, my dad always encouraged us to present our thoughts.  Then, he wouldn't pull any punches in relentlessly refuting our arguments, using a combination of facts, estimates, calculations, and rhetorical flourishes.  I learned more about influencing people and thinking clearly around the dinner table than in my entire formal education.
r0ml always questions glib answers, challenges the official version of events, distrusts things that are "intuitively obvious" or "common sense".  The skepticism I've developed as a result of his consistent example has rarely led me astray.  Glib answers, official versions, and common sense are frequently, if not always, wrong.  He taught me to search for the non-intuitive answer, the surprising inflection point in the data.
In a roundabout way, he also taught my siblings and I how to perform some delightful rhetorical flourishes of our own, but also not to trust them.  Pretty phrases can be deployed equally effectively in the service of illustration or deception.  Although I can appreciate that parents often come to a point where they've had enough and a little deception can be a useful thing.
One cannot be a practiced rhetorician without a heaping helping of eclectic life experience; r0ml has that too.  He's a fencer.  And a juggler.  He still has the highest score on Space Harrier of anyone I've ever met.  (I can remember a crowd gathering in an arcade to see him start level 18.)  He's an avid scholar of medieval thought and custom.  For that matter, he's an avid scholar of a couple dozen other things, but listing them all would take a whole day.
He has the common occupational affliction of being a science fiction fan.  However, fandom was never an identity for him. Again, by consistent example, he taught me to focus on my own creativity, and do something cool, never to just passively consume others' ideas.  He treats entertainment as an inspiration, rather than an escape.  For instance, one of the earliest memories I have about my father talking about software is a reference to the movie "Terminator".  (Please keep in mind that this memory is ~20 years old at this point, so it might not be terribly accurate.)  I remember him saying something like "All software should be relentless.  If you remove its legs, it should use its arms.  Whatever errors it encounters, it should deal with them, and keep going if it can."
Nevertheless, seeing "Tron: Legacy" with my dad, the hacker, in IMAX 3D, 20 years after we saw the original together... I didn't need to take a life lesson from that to think it was pretty rad[1].
Unlike many quiet geniuses who labor in obscurity, dispensing wisdom only to a fortunate few, r0ml is a somewhat notorious public speaker.  You can see him this year at OSCON.  If you hunt around the web, you can find some video examples of his previous talks, like this great 30-second interview[2] about the nature of open source process, from a talk he gave in 2008 (audio of the full talk here).
([1]: Although, jeez, what was the point of that whole open-source subplot at the beginning?  It seemed like a great idea, but then it went absolutely nowhere!)
([2]: Speaking of not doing things exactly the way he does - where he uses a metaphor to "single-threading" and "multi-threading", I would have said "blocking" and "event-driven" - but more on that in a future post.)
Happy Father's Day, r0ml.

Saturday, April 02, 2011

Calling all Ascetic Buddhist Rock Musicians

The Presentation

The inimitable Zooko recently made me aware of an excellent presentation about HTTPS: "It's Time to Fix HTTPS", by Chris Palmer.

The presentation is both hilarious and illuminating; I highly recommend you view it right away.  It's not saying anything that I haven't been thinking for a very long time.  Except the thing about how IE can silently add certificates to your root CA store, that was definitely new, and a little depressing.  But this is a somewhat esoteric topic and it needs to be made more popular for the everyday user.  Sexy, even.

A Brief Review

(But seriously, go read the slides, they're more entertaining.)

Internet security is based on trust.  The math behind modern cryptography doesn't ensure anything beyond that you're talking to someone that holds a particular special secret ("private key").  You can verify that the party you're talking to has the same key as the one you talked to last time, and that a particular private key corresponds to a particular public key, but that's about it.  The public key can be published for everyone to see without risking any of the secrets being sent, but you still need some way to determine whether the public key actually belongs to the person you want to talk to.  So, in order to have a secure system, you have to layer some rules on top of that which give you some way to know whether that private key corresponds to an identity that you care about and trust.

The current system goes something like this: each web browser vendor decides, more or less at random, on a group of entities we will all trust completely.  By virtue of the trust of the software, they become the authorities who can decide whose public keys are valid.  Actually, a public key isn't quite enough: you need a key plus some metadata about the person sending it: we call this a "certificate".  So these entities are termed "certificate authorities".  The browser vendors tend to decide on the same group, because there's a lot of social pressure to maintain a list that makes sense (and also, anybody who gets accepted by one browser but denied by another can't really sell certificates: the whole point of this exercise is to sell things that make the little lock icon come up, so you know your web shopping cart is "secure").

The problem with this system is that almost all of these "completely trustworthy" entities are enormous companies or, possibly even foreign governments, which have diverse motivations and huge amounts of legitimate business to conduct, making it very hard to spot a small amount of malfeasance.  (Although there is some good news: people do notice, and they freak the hell out when they do; so at least there's some policing of the current system.)  One compromised certificate authority (and there are lots and lots to try and compromise) means a complete "game over" for everybody who uses a web browser and trusts the little lock icon.

Basically there's no such thing as "completely trustworthy".  There's only: do I trust you.

The Next Step

The solution that Mr. Palmer proposes is extremely similar to the one which I thought I originally devised in about 2004, but probably was floating around in the security zeitgeist even before that.  It's a combination of 3 general principles:

Trust On First Use

Basically, the first time I see you, on the internet, it's unlikely that you're trying to trick me.  So you can give me any old public key, and I'll accept that it's you.

Mr. Palmer gives this one a catchy pseudoym, "TOFU", which I quite like (and I guess is pretty widely known at this point).

Persistence Of Pseudonym

The important point is that then I remember that it's you, forever, so it's very hard to attack our communications after that point.

I'll come up with a name for you (let's say "Bob Smith" or "The Most Secure Bank In The World Dot Com"), and my software will make sure that it sticks to that public key.  You can potentially tell me that your key has changed, but you'd better be prepared to present your old key, otherwise I have to get re-introduced to you, and now I'm suspicious that something may have been fishy.  Especially if some other thing shows up and say "Hi, it's Bob Smith" (with the correct, old public key) - "Hey, who's this guy?"

This is referred to as "POP".  Also pretty catchy.

Mesh Overlay Network Keysigning

The third concept Mr. Palmer refers to as a "trustiness metric" which includes "perspectives", and says "You can't fool all of the people all of the time".  He includes some other stuff in his trustiness metric here, but I'm going to extrapolate from that sentence:

It's really, really easy to sit down in a café and intercept some of my network traffic.  It takes about 2 minutes to collect a dozen passwords this way, on today's mostly-not-encrypted internet.  So it would be very  easy for someone to break this system if all you had was a little re-introduction warning; users might not understand it and just click anyway, and then it's just as broken (if not worse) than the current model; at least in the current model, normal users don't usually get those warnings, and they're "safe" if they're looking for the lock, but in this new model, users would get them for all new secure introductions.  So we need something better.
It's not so easy to sit down in a café and intercept network traffic from me and also intercept traffic from my friend, on a different network, doing a different thing.  You have to know where my friend is.  You have to be able to intercept our pre-arranged secure communication (I already remember all my friends keys when I first see them, you'll recall).  If you're a casual attacker who just wants to sniff a couple of credit card numbers at the local starbucks, you probably don't have the resources to do that, even for a single individual.

It is definitely not easy to figure out where every single one of my currently-online friends - let's say Facebook friends, because you can maybe they finally care about security now - is online from, and also attack their networks simultaneously, to provide exactly the same bogus first-introduction certificate to Super Secure Bank Dot Com.  This is a level of sophistication and coordination that not even most governments can muster.

So if we had a reasonably available mesh overlay network, where I can tell my friends, and my friends can tell their friends (etc forever) about first-introduction key correspondence with DNS names, and legitimate changes to keys where the site operator has had a security problem, then we could address many of these issues much more robustly than we can today.  It might not be perfect, but it would silently work often enough that it would be much better than today's default of "bah, I don't know why you're getting the browser warning; just use HTTP".

Badump Ching

If you've been paying attention I think you can see where I am going with this.

We (those of us in the open source hipster security noosphere) need to popularize this concept, because it's not that hard to implement, people keep re-inventing it everywhere, it's mostly just about getting some browser vendor to think it's a good idea.

The acronym is TOFU POP MONK, so clearly we need a vegetarian monk - buddhist seems most likely - who sings pop songs about how great tofu is.  We need it to go viral on the you tubes, and any other tubes that are appropriate.

(Graphic design nerds, and sports racers of all stripes, start your engines.  I challenge you.  Show me some awesome macroable meme images starring the Tofu-Pop Monk.  I will post any particularly compelling ones here.)

Saturday, December 04, 2010

Resolving diverged Bazaar branches on the go with 'dead heads'.

If you're like me, occasionally you grab the latest version of a bzr branch onto your laptop before you're going somewhere without network access. But, as you're about to leave, you glance over at your laptop screen, and you see the dreaded:
bzr: ERROR: These branches have diverged. Use the missing command to see how.
Use the merge command to reconcile them.
but you don't have time to do a merge, and wait for the (reliably agonizingly slow) network round trip to negotiate with the server about what the latest revision is - the train's about to leave, or you're late for your flight, or the cafe is closing and you need to shut your laptop right now.  Sadness!  You continue to work on a diverged branch and merge later.  Which is a shame, because mechanically dealing with merge conflicts or just making sure the tests still pass after what looks like a trivial merge is exactly the sort of thing which is convenient to do when you're stuck waiting at a network-access-free bus stop.
As it turns out, Bazaar has actually already done all the hard work necessary for you to just go ahead and do that merge when you get to your potentially non-networked destination.  The diverged revisions have already been pulled into your branch and are just sitting there, waiting to be merged, but you can't see them.  The 'bzrtools' plugin provides the 'heads' command, which you can use to reveal the previously invisible revision.  You can then just 'merge .' instead of merging from your usual pull location, as long as you specify the appropriate revision.
To demonstrate, here's a transcript of a sample session which simulates this common problem:
First, set up a branch:
you@computer:~$ mkdir tmp
you@computer:~$ cd tmp
you@computer:~/tmp$ mkdir a
you@computer:~/tmp$ cd a
you@computer:~/tmp/a$ bzr init
Created a standalone tree (format: 2a)
you@computer:~/tmp/a$ touch initial.txt
you@computer:~/tmp/a$ bzr add
adding initial.txt
you@computer:~/tmp/a$ bzr ci -m "inital revision"
Committing to: /Domicile/glyph/tmp/a/
added initial.txt
Committed revision 1.
We'll call 'a' the 'server' branch. Next, let's make a branch that represents the 'on the go' branch, your local working copy:
you@computer:~/tmp/a$ cd ..
you@computer:~/tmp$ bzr get a b
Branched 1 revision(s).
Now, it's time to diverge. Let's give each branch its own revision.
you@computer:~/tmp$ cd a
you@computer:~/tmp/a$ touch a.txt
you@computer:~/tmp/a$ bzr add
badding a.txt
zyou@computer:~/tmp/a$ bzr ci -m 'revision from a'
Committing to: /Domicile/glyph/tmp/a/
added a.txt
Committed revision 2.
you@computer:~/tmp/a$ cd ../b/
you@computer:~/tmp/b$ touch b.txt
you@computer:~/tmp/b$ bzr add
adding b.txt
you@computer:~/tmp/b$ bzr ci -m 'revision from b'
Committing to: /Domicile/glyph/tmp/b/
added b.txt
Committed revision 2.
Now, it's time to get on that sad, wifi-free train. Let's make sure we're up to date with 'a' first...
you@computer:~/tmp/b$ bzr pull ../a
bzr: ERROR: These branches have diverged. Use the missing command to see how.
Use the merge command to reconcile them.
[Error: 3]
Oh no! But, here comes 'bzr heads' to the rescue:
you@computer:~/tmp/b$ bzr heads --dead
HEAD: revision-id: <strong>you@computer-123456</strong> (dead)
committer: You <you@computer>
branch nick: a
timestamp: now-ish
message:
revision from a
Now you know what the revision ID of the already-pulled-but-not-visible revision is - the tip of 'a', in other words. Now you just need to ask 'b' to merge it:
you@computer:~/tmp/b$ bzr merge . -r <strong>you@computer-123456</strong>
+N  a.txt
All changes applied successfully.
you@computer:~/tmp/b$ bzr ci -m 'merge from a'
Committing to: /Domicile/glyph/tmp/b/
added a.txt
Committed revision 3.
Done! And, as you can see when you get back to your cozy 10gigE fiber connection at home, or whatever you happen to have, you see that the revision you've merged lines up neatly with 'a':
you@computer:~/tmp/b$ bzr pull ../a
No revisions to pull.
you@computer:~/tmp/b$
Et voila. I hope this saves somebody some time when dealing with failed pulls.
For those of you who may be curious about the use-case, if you don't have it: I rarely encounter this with actual codebases I work on, as I tend to have a local trunk mirror, and features are neatly segregated into branches. It comes up more frequently in my personal configuration-files repository, where I make little changes to my desktop, little changes to my laptop, and then want to get out the door quickly with the latest merged copy. I was so happy when #bzr on freenode (thanks, spiv!) solved this problem for me that I just had to share.

Wednesday, January 06, 2010

Some Common Onomatological Errors

The open-source event-driven networking engine that I work on is called "Twisted".  If you're uncomfortable using something that sounds like an adjective in a place where a noun should go, the following noun phrases are equivalent:
  1. the Twisted project
  2. the Twisted engine
  3. the Twisted networking engine
  4. the Twisted framework
The unofficial group (of which I am a member) which works on that software is known as "Twisted Matrix Laboratories", sometimes shortened to "Twisted Matrix Labs" or "TMLabs".

I can understand that there is some confusion around this stuff, since these words often appear in close proximity, but to my knowledge there is nothing called "Python Twisted", "Twisted Python", or "Twisted Matrix".  There's "python-twisted", which is the package name that some operating systems use to package Twisted.  There is also "twisted.python", which is a python  package within Twisted itself.  Finally there is "twisted-python@twistedmatrix.com", which is the mailing list for discussing Twisted stuff in the Python programming language.  (This discussion list is so named to distinguish it from the possibility of not-quite-hypothetical discussion of Twisted implemented in other languages, although no other implementations are currently actively maintained.)

I just thought you'd all like to know that.  That is all.  (For now, anyway.)

Saturday, October 24, 2009

Learn Twisted

Jean-Paul Calderone continues his excellent "Twisted Web In 60 Seconds" tutorial series.  If you haven't checked it out yet, you should!

Do you want WiFi to work at your conference?

I've been pretty busy for the last couple of weeks, so I've just had an opportunity to catch up with blog posts that have been piling up.  In particular I noticed this one: The “WiFi At Conferences” Problem, by Joel Spolsky.

Joel has a lot of what look like good recommendations.  However, I can provide a much-abridged list.

Some years, WiFi access at PyCon US has been provided by the venue, or by a contractor whose name I mercifully do not know.  Those years, it has not worked.  Some years, it has been provided, or at least managed, by tummy.com.  Those years, it has worked.  They are probably much more critical of their own efforts than I am, as you can see in this thorough write-up that they did of PyCon's 2008 WiFi situation.

My two-step plan for you if you want your conference to have working WiFi access at your conference is:
  1. e-mail somebody at tummy.com, telling them that you want a working wireless network, and
  2. give them whatever they ask for.
If you do these things, then when people open their laptops at your conference, their networks will work.

Sunday, October 04, 2009

Hobgoblin History

I like Terry Jones; I think FluidDB has a lot of potential.  But, sometimes when he's talking about it, he gets a little carried away and forgets that the rest of us don't live in his future yet.  In his latest missive on the official FluidDB blog, "Digital Hobgoblins", he describes some of the problems that FluidDB sets out to solve.

The problem is, I already have solutions for all of these problems, and I don't quite understand why they don't (or shouldn't) work for me.  (Since he organizes the post in terms of problems that existing systems have, I'm going to take the liberty of re-labeling these in terms of the problems that he seems to be describing rather than the lead text he used.  Please post a comment if you think my labeling is wrong.)

In existing systems, Terry says:

"Things must be named, and have one name."  Specifically, Terry calls out file systems.  Except... file systems have lots of ways of introducing multiple names for the same thing.  Symbolic links.  Hard links, if you really want to allow for ambiguity.  If you want to track that ambiguity, Windows "shortcuts" and MacOS "aliases" can do that.  Overlay mounts, loopback mounts and chroot execution allow for semi-arbitrary renaming.  Lots of other systems support this, too.  Database systems have a specific provision for multiple names: the many-to-one relation.  Any programming language with pass-by-reference data structures allows for some level of multiple-naming.  In fact, there's a whole discipline for allowing things to have lots of different names: indexing.  Anywhere you have a full-text index or an object where multiple attributes are indexed in some kind of database, you've got objects with more than one name.

"You have to be consistent and unambiguous."  As I mentioned on the first point, there are lots of ways to be slightly ambiguous at a human level.  You can refer to the same thing by different names, or, with mutable binding, you can refer to the different things with the same name.  In some circumstances, you must be precise, but that's because fundamentally, algorithmic thinking requries a certain level of precision, not because of any specific problem with computers.  In fact, there is a word for inconsistency and ambiguity in programming languages: polymorphism.  Any time you invoke an interface rather than a concrete implementation (which is to say any time you do anything in a dynamic language like Python) you are being ambiguous and potentially inconsistent in your program's behavior.

"You only get one way to organize stuff."  This is a pretty weak point, though, given that Terry himself immediately turns around and notices that tagging and other multiply-indexed database systems are becoming popular.  So he gives us two examples of exceptions, but no examples of the rule.  I'm not sure what I could add to that.

"Programmers are obsessed with "meaning"."  On this one, I'm going to agree, except I don't think it's a problem.  In the computational world, we are obsessed with the meaning of data, because if you get the meaning of the inputs wrong, then the meaning of the outputs is wrong too.  For example: if you have a number that represents the total liabilities that your company has accumulated, it's pretty important that you don't ever treat that as your total profit.  At a deeper level, if you have a sequence of bits that represents a floating-point number, it's important to know about its intended meaning, and not treat it as a string of characters, unless what you really want is a string.  "@H=N" is not as useful a concept as "3.1287417411804199" if you are trying to add it to something.  For what it's worth, I have my own, similar take on how we should treat computational objects that have multiple meanings: Imaginary. Even systems like Imaginary and FluidDB depend on a very rigid definition of some simpler concepts, like numbers consistently being numbers and words consistently being words.  In my view, even if we treat the book itself as multifaceted, it's important to know what the data representing the "readable object" part of a book is really "about", and make sure it stays distinct from the data representing the "paperweight" part of the book.  To be fair, FluidDB appears to do this itself — and this terminology is my least-favorite part of FluidDB — by having single-purpose, permission-controlled "objects" just like every other system, but calling them "tags", and re-using the word "objects" to refer instead to what others might call a "UUID" or "central index".  In Imaginary, the system is similar; although the centrality of the FluidDB "object" (in Imaginary's case, the "Thing") is less stark; using FluidDB's terminology, in Imaginary, a "tag" can have a "tag" of its own; in fact, there's nothing but tags ("Items") anywhere.

"Metadata is separated from the data it describes."  This may be true in some systems, but the web is probably the system with the most data in it anywhere, and in that system, metadata is always available as part of the request and the response.  You can put in any headers you want in the response, and there are lots of pieces of metadata (like content-type) which are almost always found along with the data.  In my opinion, the problem is more that we don't have enough of the previous problem.  Web developers haven't been obsessed enough with meaning: there aren't enough useful conventions around the HTTP request/response metadata, and so it's hard to bundle more metadata in with your response and have it faithfully propagated elsewhere.  We don't know what arbitrary headers might mean, because we don't have any way of expressing a schema for them.

Terry says he's going to write more about these problems, and the solutions that FluidDB provides for them.  I'm looking forward to it.  As part of that, I'd really like to see a clear description of how these problems affect me, or someone I know, either as a programmer or as a user.  What do I, or should I, really want to do with some application right now that these five problems are preventing me from doing?

The reason I felt compelled to write about this is that history — and particularly the history of websites like freshmeat and sourceforge — is littered with the corpses of projects which promised to fundamentally change the way we represent data.  A common problem with these projects is that they have expansive denunciations of current techniques to represent data, or manage persistence, and claim to provide an advance so significant that they will displace all current applications.  What most of the people working on these projects don't realize is that the current techniques for representing data have a history, and there are good reasons for their limitations.  Granted, not all of those reasons are currently relevant, and many are examples of path dependence, but it's still important to understand the reasons in order to escape the problems.

In FluidDB's case, I think that the problem isn't so much that Terry doesn't have the historical perspective, but that he assumes that we all do.  And that we can all make the cognitive leap to see why FluidDB is necessary.  But if I can't do it, I have to assume there are at least a few other programmers who aren't getting the message either.

Thursday, September 24, 2009

Diesel: A Case Study In That Thing I Just Said

Thanks to jamwt for the shout-out on the announcement of Diesel.

Since the reaction to my reaction to tornado was so good (or at least so ... energetic), I figure I should comment on Diesel as well.  Spoiler alert: my reaction is ... largely similar, but since jamwt has been kind of nice to Twisted in the past, and didn't actually say anything mean this time, I'm somewhat reluctant to have that reaction.  Nevertheless, I swore a solemn oath to tell it like it is, keep it real, and soforth.  So I must.

Once again, I'm happy that event-driven programming is getting some love.  This time, I'm pleased that nobody is saying anything especially snarky or FUD-ish about Twisted.  I do feel like it's a little weird not to mention Twisted, or include some comparisons to Nevow or Orbited, both of which provide different, comprehensive approaches to COMET with Twisted.

(Worth noting: Orbited also originally started out using its own event-driven I/O layer, but switched to Twisted later, because Twisted is "crazy delicious".)

Diesel has many more interesting ideas at the level of async I/O than Tornado did.  I think the generator-based approach for implementing protocols is interesting and deserves some more exploration.  I'm not sold on it for every use-case, and I think the implementation might have some flaws, but it definitely has some advantages.

I'd give jamwt a hard time for not reporting issues and communicating with Twisted more before re-writing the core, but for three issues:
  1. jamwt's been around in the Twisted community for a while.  He's written a bunch of fairly deep Twisted code and he clearly knows what the framework is capable of.
  2. I've spoken with him on a number of occasions, and for all I know I might have discussed this with him.  I don't remember it, but it would be pretty embarrassing to write a big rant about how nobody talks to us only to have him paste some chat log where he explained why he was writing Diesel six months ago, and I said "oh, okay" ;-).
  3. Nobody is calling Twisted names or making vague, unsubstantiated accusations.  You're not obligated to examine Twisted, nor Nevow, nor Orbited, I just feel that you owe us some explanation if you publicly say that you tried it and found it wanting.  The tone on the Diesel announcement, in its one brief mention of Twisted, is "we tried it, but we kinda wanted to do our own thing".  So, good for them, they did their own thing, I hope they had fun.
Now, personally, I'd like to leave it at that, but there is a certain inevitable comparison that I think is going to take place.  Diesel has a nicer web page than Twisted.  They have entwittered ... twitified ... uh ... tweetened ... the project, and we haven't; we just have an old-fashioned "blog".  Diesel is smaller than Twisted, so it's easier to explain, and so the people approaching it will have a better idea of its scope.  This might give the immediate impression that it is a simpler, better, more "modern" replacement for Twisted's I/O layer, and this is not the case.  So I still feel it's important that I set the record straight.

Before I launch into my critique, I should say that I don't want to harsh on Diesel too bad. It's a neat little hack and you should go play with it.  And I feel bad pointing out problems with it, since as I mentioned above, nobody's dumping on Twisted.  So, Diesel fans, please take this in the spirit of a frank code-review, not a complaint about your behavior.

The interesting generator-munging bits could be easily adapted to run on top of Twisted's loop, which, arguably, they should have been in the first place; and the toy "hub" that they've written might be good enough for some simple applications where reliability under load is not a serious concern.  In fact, inlineCallbacks might provide a good deal of what is needed to support Diesel's programming style.  Alternately, Diesel might provide some hints as to how things like inlineCallbacks could be made more efficient.

That said, Diesel's I/O loop sucks.

It's disappointing to see the same mistakes getting made over and over again.  First and foremost: no tests.  Come on, Python community!  You can do better!  Write your damn tests first!

The #1 benefit that a brand-new I/O loop project could have over Twisted is that Twisted was written in the bad old days before everybody knew that TDD was the right way to write programs, so we don't have 100% test coverage.  But, we strive to get closer every day, while every new project decides that they don't need no stinking quality control.

Predictably, as it has no tests, Diesel's I/O layer is full of dead code, inaccurate  documentation, and unhandled errors.  Consider this gem, which I found about 30 seconds into reading the code: KqueueEventHub is documented to be "an epoll-based event hub", and its initializer defines an inner function which is never used.  I'm not going to belabor the point by enumerating all the typo bugs I found, but you may find the output of 'pyflakes diesel' interesting.

Instead of Tornado's inaccurate handling of EINTR, Diesel has no handling of EINTR, as far as I can tell.  It also doesn't handle EPERM, ENOBUFS, EMFILE, or even EAGAIN on accept().  To be fair, it has a catch-all exception handler all the way at the top of the stack, so none of these will cause instant crashes, but they will cause surprising behavior in odd situations (and possibly infinite traceback-spewing loops).

More surprisingly - I had to re-read the code about five times to make sure - it doesn't appear that sockets are ever set to be non-blocking, and EAGAIN is not handled from accept(), recv(), or send().  And yes, this can happen even if your multiplexor says your socket is ready for reading and/or writing.  The conditions are somewhat obscure, but nevertheless they do happen.  So, occasionally, Diesel will hiccup and block until some slow network client manages to send or receive some traffic.  In other words: Diesel is not really async.  It just fakes it convincingly, most of the time.

Once again, there's no way to asynchronously spawn a process, and no way to asynchronously connect a TCP client.  Sure, this looks like an asynchronous connect call, but it's misleading: it blocks on resolving the hostname, and it potentially blocks on the initial SYN/ACK/SYN+ACK exchange.  There's no asynchronous SSL support.  And no, that is not trivial.  Not to mention handling all the crazy errors that spew out of the Windows TCP stack.  And since the loop is implemented to be incompatible with Twisted, it's not obviously trivial to compatibly plug it in and get those features.

Again, I don't want to dump on Diesel here; for what it is, i.e. an experiment in how to idiomatically structure asynchronous applications, it's all right.  For that matter Twisted has its fair share of bugs too, which would be pretty easy to lay out in a similar post; you wouldn't even need to do the research yourself, just go look at our bug tracker.

But both Diesel and Tornado make the mistake of attempting to replace the years of trial-and-error, years of testing discipline, and years of portability and feature work that Twisted has accumulated with a few oversimplified, untested hacks.

What they could have done is contributed any extensions that they needed to Twisted's loop, or modifications to Twisted's packaging that would allow them to get a smaller sliver of Twisted's core to bootstrap, if that's what they needed.

My goal in pointing out all these flaws is not to illustrate any particular point about Diesel, but to reinforce the point I implicitly made in my Tornado post, which is that if you try to write a new mainloop (especially without tests) you will screw it up.  You will most likely screw it up in ways which will only surface later, under mysterious circumstances, when your servers are under load and you are under the gun for a deadline.

Or if I happen to get wind of it and write a blog post about it, of course.  Then you get to cheat a little.

It's not an indictment of Diesel that it screwed this up; everyone screws it up.  I would probably screw it up, if I didn't have Twisted sitting in front of me as a direct reference.  POSIX by itself is unreasonably subtle and difficult, but POSIX, plus the subtle variations in different platforms which implement it, plus the Windows APIs which are almost-but-not-quite-exactly-nothing-like the POSIX APIs, presents an inhuman challenge.

Hopefully Diesel will grow some tests.  Hopefully it will fix, or better yet shed, its somewhat unfortunate I/O hub.  I am hopeful that someone will follow Dustin's excellent lead (perhaps Dustin himself!) and port Diesel's API and generator system over to Twisted's I/O architecture and eliminate all these silly bugs.  Of course, it someone did that, you could use Dustin's tornado port with Diesel.

With the silly bugs from the I/O loop out of the way, the Diesel team can write tests for the more interesting pieces, and fix the bugs which aren't entirely silly :-).