Deciphering Glyph
A collection of articles, ideas, and rambling from a guy who wrote some software that one time.
Friday, January 20, 2012
The Concurrency Spectrum: from Callbacks to Coroutines to Craziness
Concurrent programming idioms are on a spectrum of complexity.
Obviously, writing code that isn't concurrent in any way is the easiest. If you never introduce any concurrent tasks, you never have to debug any problems with things running in an unexpected order. But, in today's connected world, concurrency of some sort is usually a requirement. Each additional point where concurrency can happen introduces a bit of cognitive overhead, another place you need to think about what might happen, so as a codebase adds more of them it becomes more difficult to understand them all, and it becomes more challenging to understand subtle nuances of parallel execution.
So, at the simplest end of the spectrum, you have callback-based concurrency. Every time you have to proceed to the next step of a concurrent operation, you have to create a new function and new scope, and pass it to the operation so that the appropriate function will be called when the operation completes. This is very explicit and reasonably straightforward to debug and test, but it can be tedious and overly verbose, especially in Python where you have to think up a new function name and argument list for every step. The extra lines for the function definition and return statement can be an impediment to quickly understanding the code's intentions, so what facilitates understanding of the concurrency model can inhibit understanding of the code's actual logical purpose, depending on how much concurrent stuff it has to do. Twisted's Deferreds make this a bit easier than raw callback-passing without fundamentally changing the execution dynamic, so they're at this same level.
Then you have explicit concurrency, where every possible switch-point has to be labeled somehow. This is yield-based coroutines, or inlineCallbacks, in Twisted. This is more compact than using callbacks, but also more limiting. For example, you can only resume a generator once, whereas you can run a callback multiple times. However, for a logical flow of sequential concurrent steps, it reads very naturally, and is shorter, as it collapses out the 'def' and 'return' lines, and you have to think of at least two fewer names per step.
However, that very ease can be misleading. You might gloss over a 'result = yield ...' more easily than a 'def whatever(result): return result; something(whatever)'. Nevertheless, if you have 'yield's everywhere you might swap your stack, then when you have a concurrency bug, you can look at any given arbitrary chunk of code and know that you don't need any locks in it, as long as you can't see any yield statements. Where you do see yield statements, you know that you have some code that needs to be inspected.
To continue down that spectrum, a cooperatively multithreading program with implicit context switches makes every line with any function call on it (or any line which might be a function call, like any operator which can be overridden by a special method) a possible, but not likely culprit. Now when you have a concurrency bug you have to audit absolutely every line of code you've got, although you still have a few clues which will help you narrow it down and rule out certain areas of the code. For example, you can guess that it would be pathological for 'x = []; ...; x.append(y)' to context switch. (Although, given arbitrary introspection craziness, it is still possible, depending on what "..." is.) This is way more lines than you have to consider with yield, although with some discipline it can be kept manageable. However, experience has taught me that "with some discipline" is a code phrase for "almost never, on real-life programming projects".
All the way at the end of the spectrum of course you have preemptive multithreading, where every line of code is a mind-destroying death-trap hiding every possible concurrency peril you could imagine, and anything could happen at any time. When you encounter a concurrency bug you have to give up and just try to drink your sorrows away. Or just change random stuff in your 'settings.py' until it starts working, or something. I never really did get comfortable in that style. With some discipline, you can manage this problem by never manipulating shared state, and only transferring data via safe queueing mechanisms, but... there's that phrase again.
Some programming languages, like Erlang, support efficient preemptive processes with state isolation and built-in super-cheap super-fast queues to transfer immutable values. (Some other languages call these "threads" anyway, even though I would agree with Erlang's classification as "processes".) That's a different programming model entirely though, with its own advantages and challenges, which doesn't land neatly on this spectrum; if I'm talking about left and right here, Erlang and friends are somewhere above or below. I'm just describing Python and its ilk, where threads give you a big pile of shared, mutable state, and you are constantly tempted to splash said state all over your program.
Personally I like Twisted's style best; the thing that you yield is itself an object whose state can be inspected, and you can write callback-based or yield-based code as each specific context merits. My opinion on this has shifted over time, but currently I find that it's best to have a core which is written in the super-explicit callback-based approach with no coroutines at all, and then high-level application logic which wraps that core using yield-based coroutines (@inlineCallbacks, for Twisted fans).
I hope that in a future post, I may explain why, but that would take more words than I've got in me tonight.
Thursday, December 15, 2011
I'm Sorry It's Come To This
I usually try not to get too political in my public persona – on blogs, twitter, IRC, mailing lists et cetera – and that's a conscious choice.
I work on open source software. I have for the last ten years. I am lucky enough to have founded a project of my own, but in open source, leaders are more beholden to their followers than vice versa. I depend on people showing up to effectively work for me, for free, on a regular basis. So, I try to avoid politics not because I don't have strong convictions (anyone who knows me personally can tell you that I certainly do) but because I don't want someone to avoid showing up and helping do some good in the world in one area, just because we might disagree in another.
This is a benefit of living in a free and democratic society: we have ways to dispute issues that we have strong feelings about, so we can cooperate on some things without having to agree on everything. It's rarely perfect but we can usually get some good stuff done, with rough consensus and running code.
Today though, there's a political issue which I can't ignore. The purpose of Twisted (the open source project which I founded) is to facilitate the transfer of information across the Internet. A new law, SOPA, is threatening to radically alter the legal infrastructure of the Internet in the United States, granting sweeping new powers to copyright cartels and fundamentally restricting the legal right to transfer any information, and to build tools that transfer it. Twisted is designed to make it easy to implement new protocols, to easily experiment with improvements to systems like the Domain Name System. SOPA might well make those potential improvements, and with only a little paranoid fantasizing, Twisted itself, illegal.
It's my view that this law is a blatantly unconstitutional restriction on free speech. It will kill job creation, at a time when our nation can scarce afford another blow to its economy. It will create the infrastructure to suppress political dissent, similar to the infrastructure in China and Syria, at a time when our corrupt political system needs dissent more than ever. It is the wrong thing at the wrong time.
This bill is being discussed in the house today. If you're in the US, call your representative right now.
(As always, I don't speak for anyone but myself; no one else has reviewed or endorsed these remarks.)
Friday, November 04, 2011
Blocking vs. Running
A thread is blocking when it is performing an input or output operation that may take an unknown amount of time. Crucially, a blocking thread is doing no useful work. It is stuck, consuming resources - in particular, its thread stack, and its process table entry. It is sucking up resources and getting nothing done. These are resources that one can most definitely run out of, and are in fact artificially limited on most operating systems, because if one has too many of them, the system bogs down and becomes unusable.
A thread may also be "stuck" doing some computationally intensive work; performing a complex computation, and sucking up CPU cycles. There is a very important distinction here, though. If that thread is burning up CPU, it is getting work done. It is computing. This is why we have computers: to compute things.
It is of course possible for a program to have a bug where a program goes into an infinite loop, or otherwise performs work on the CPU without actually getting anything useful to the user done, but if that's happening then the program is just buggy, or inefficient. But such a program is not blocking: it might be "thrashing" or "stuck" or "broken", but "blocking" means something more specific: that the program is sitting around, doing nothing, while it is waiting for some other thing to get work done, and not doing any of its own.
A program written in an event-driven style may be busy as long as it needs to be, but that does not mean it is blocking. Hence, event-driven and non-blocking are synonyms.
Furthermore, non-blocking doesn't necessarily mean single-process. Twisted is non-blocking, for example, but it has a sophisticated facility for starting, controlling and stopping other processes. Information about changes to those processes is represented as plain old events, making it reasonably easy to fold the results of computation in another process back into the main one.
If you need to perform a lengthy computation in an event-driven program, that does not mean you need to stop the world in order to do it. It doesn't mean that you need to give up on the relatively simple execution model of an event loop for a mess of threads, either. Just ask another process to do the work, and handle the result of that work as just another event.
Saturday, September 10, 2011
2L2T: DjangoCon Feedback
For the most part, the talk was exceedingly well-received and I want to thank the Django community both for the opportunity to speak and for the overwhelmingly positively response. Thanks for making an outsider to your community feel welcome and appreciated.
There have been a couple misconceptions though, and perhaps I didn't express myself clearly on a few points.
- I realize that there are times – plenty of times, even – when using some component that's in a different language from your main application is the right choice. I wasn't trying to say "all Python all the time no matter what, no exceptions". I just want you all to consider that there is a cost to using a component that's in a different language, and you should be aware of that cost. It's not as simple as a tick-a-box feature comparison of the features and drawbacks of multiple products. If I came out as sounding really extreme on this, it just was to provoke a response.
- You can have an architecture which is driven by Python and organized by Python without actually having all the implementation be in Python. For example, an inordinate number of people asked me about memcache. If you want using something like that, sure, use memcache, there's not a lot that it being in Python would buy you. Some might say that the whole point of memcache is that it isn't very deeply configurable and doesn't have much in the way of behavior. Plus, it's an internal component, not an externally visible service, so even my usual flimsy "no buffer overflows" argument doesn't really hold up; it's more like a library than a server. You can incorporate memcache into a Python-in-the-driver's-seat architecture by spawning memcache from your Python process instead of making memcache a configuration dependency. That way, you don't need a separate configuration file and a separately managed service or a chef script that boots memcache for you before your application. This applies equally well to any other, similar services: write their config files from your Python code, and start them automatically.
Update: The video of my talk is now available on blip.tv.
Sunday, June 19, 2011
ἁγιολογία for r0ml
My dad had a cool hacker alias in the seventies. He's been known as "r0ml" around the web since before there was a web. If you are in a particularly typographically hip part of the internet, it might even be "RØML". How many of your parents have a nom de plume with a digit, or a non-ASCII character in it? Or, for that matter, any kind of hacker pseudonym?
I had the good fortune to work with one of r0ml's colleagues, Amir Bakhtiar. Amir paid me one of the highest compliments I've ever received: he said that the code for systems I've worked on is similar to r0ml's in its style and exposition. My dad taught me how to program in x86 assembler, and in that process, I learned a lot about the way he thought about solving problems and building systems. I regard thinking that well, or even comparably well, as a real achievement.
That's not to say that I would do everything exactly the way that he does. For example, he writes a lot of networking code in Java. He doesn't use Twisted, for the most part. If you know me and you know my dad, you know that we disagree on plenty of stuff.
Unlike the stereotypical, often-satirized filial argument, these discussions are something I look forward to. Disagreeing with my dad is still one of the most intellectually challenging activities I've ever engaged in. Whenever I have a conversation with him about a topic where he has a different view, I come away enlightened – if not necessarily convinced.
Conversations among my friends occasionally turn to the topic of our respective upbringings, as they do in any close group. One of the recurring themes of my childhood is that, while my siblings and I were sometimes told to be quiet, we were never told to be quiet because our opinions weren't valuable. Sometimes we were told in unequivocal terms that we were wrong, of course. However, my dad always encouraged us to present our thoughts. Then, he wouldn't pull any punches in relentlessly refuting our arguments, using a combination of facts, estimates, calculations, and rhetorical flourishes. I learned more about influencing people and thinking clearly around the dinner table than in my entire formal education.
r0ml always questions glib answers, challenges the official version of events, distrusts things that are "intuitively obvious" or "common sense". The skepticism I've developed as a result of his consistent example has rarely led me astray. Glib answers, official versions, and common sense are frequently, if not always, wrong. He taught me to search for the non-intuitive answer, the surprising inflection point in the data.
In a roundabout way, he also taught my siblings and I how to perform some delightful rhetorical flourishes of our own, but also not to trust them. Pretty phrases can be deployed equally effectively in the service of illustration or deception. Although I can appreciate that parents often come to a point where they've had enough and a little deception can be a useful thing.
One cannot be a practiced rhetorician without a heaping helping of eclectic life experience; r0ml has that too. He's a fencer. And a juggler. He still has the highest score on Space Harrier of anyone I've ever met. (I can remember a crowd gathering in an arcade to see him start level 18.) He's an avid scholar of medieval thought and custom. For that matter, he's an avid scholar of a couple dozen other things, but listing them all would take a whole day.
He has the common occupational affliction of being a science fiction fan. However, fandom was never an identity for him. Again, by consistent example, he taught me to focus on my own creativity, and do something cool, never to just passively consume others' ideas. He treats entertainment as an inspiration, rather than an escape. For instance, one of the earliest memories I have about my father talking about software is a reference to the movie "Terminator". (Please keep in mind that this memory is ~20 years old at this point, so it might not be terribly accurate.) I remember him saying something like "All software should be relentless. If you remove its legs, it should use its arms. Whatever errors it encounters, it should deal with them, and keep going if it can."
Nevertheless, seeing "Tron: Legacy" with my dad, the hacker, in IMAX 3D, 20 years after we saw the original together... I didn't need to take a life lesson from that to think it was pretty rad[1].
Unlike many quiet geniuses who labor in obscurity, dispensing wisdom only to a fortunate few, r0ml is a somewhat notorious public speaker. You can see him this year at OSCON. If you hunt around the web, you can find some video examples of his previous talks, like this great 30-second interview[2] about the nature of open source process, from a talk he gave in 2008 (audio of the full talk here).
([1]: Although, jeez, what was the point of that whole open-source subplot at the beginning? It seemed like a great idea, but then it went absolutely nowhere!)
([2]: Speaking of not doing things exactly the way he does - where he uses a metaphor to "single-threading" and "multi-threading", I would have said "blocking" and "event-driven" - but more on that in a future post.)
Happy Father's Day, r0ml.
Saturday, April 02, 2011
Calling all Ascetic Buddhist Rock Musicians
The Presentation
The inimitable Zooko recently made me aware of an excellent presentation about HTTPS: "It's Time to Fix HTTPS", by Chris Palmer.The presentation is both hilarious and illuminating; I highly recommend you view it right away. It's not saying anything that I haven't been thinking for a very long time. Except the thing about how IE can silently add certificates to your root CA store, that was definitely new, and a little depressing. But this is a somewhat esoteric topic and it needs to be made more popular for the everyday user. Sexy, even.
A Brief Review
(But seriously, go read the slides, they're more entertaining.)Internet security is based on trust. The math behind modern cryptography doesn't ensure anything beyond that you're talking to someone that holds a particular special secret ("private key"). You can verify that the party you're talking to has the same key as the one you talked to last time, and that a particular private key corresponds to a particular public key, but that's about it. The public key can be published for everyone to see without risking any of the secrets being sent, but you still need some way to determine whether the public key actually belongs to the person you want to talk to. So, in order to have a secure system, you have to layer some rules on top of that which give you some way to know whether that private key corresponds to an identity that you care about and trust.
The current system goes something like this: each web browser vendor decides, more or less at random, on a group of entities we will all trust completely. By virtue of the trust of the software, they become the authorities who can decide whose public keys are valid. Actually, a public key isn't quite enough: you need a key plus some metadata about the person sending it: we call this a "certificate". So these entities are termed "certificate authorities". The browser vendors tend to decide on the same group, because there's a lot of social pressure to maintain a list that makes sense (and also, anybody who gets accepted by one browser but denied by another can't really sell certificates: the whole point of this exercise is to sell things that make the little lock icon come up, so you know your web shopping cart is "secure").
The problem with this system is that almost all of these "completely trustworthy" entities are enormous companies or, possibly even foreign governments, which have diverse motivations and huge amounts of legitimate business to conduct, making it very hard to spot a small amount of malfeasance. (Although there is some good news: people do notice, and they freak the hell out when they do; so at least there's some policing of the current system.) One compromised certificate authority (and there are lots and lots to try and compromise) means a complete "game over" for everybody who uses a web browser and trusts the little lock icon.
Basically there's no such thing as "completely trustworthy". There's only: do I trust you.
The Next Step
The solution that Mr. Palmer proposes is extremely similar to the one which I thought I originally devised in about 2004, but probably was floating around in the security zeitgeist even before that. It's a combination of 3 general principles:Trust On First Use
Basically, the first time I see you, on the internet, it's unlikely that you're trying to trick me. So you can give me any old public key, and I'll accept that it's you.Mr. Palmer gives this one a catchy pseudoym, "TOFU", which I quite like (and I guess is pretty widely known at this point).
Persistence Of Pseudonym
The important point is that then I remember that it's you, forever, so it's very hard to attack our communications after that point.I'll come up with a name for you (let's say "Bob Smith" or "The Most Secure Bank In The World Dot Com"), and my software will make sure that it sticks to that public key. You can potentially tell me that your key has changed, but you'd better be prepared to present your old key, otherwise I have to get re-introduced to you, and now I'm suspicious that something may have been fishy. Especially if some other thing shows up and say "Hi, it's Bob Smith" (with the correct, old public key) - "Hey, who's this guy?"
This is referred to as "POP". Also pretty catchy.
Mesh Overlay Network Keysigning
The third concept Mr. Palmer refers to as a "trustiness metric" which includes "perspectives", and says "You can't fool all of the people all of the time". He includes some other stuff in his trustiness metric here, but I'm going to extrapolate from that sentence:It's really, really easy to sit down in a café and intercept some of my network traffic. It takes about 2 minutes to collect a dozen passwords this way, on today's mostly-not-encrypted internet. So it would be very easy for someone to break this system if all you had was a little re-introduction warning; users might not understand it and just click anyway, and then it's just as broken (if not worse) than the current model; at least in the current model, normal users don't usually get those warnings, and they're "safe" if they're looking for the lock, but in this new model, users would get them for all new secure introductions. So we need something better.
It's not so easy to sit down in a café and intercept network traffic from me and also intercept traffic from my friend, on a different network, doing a different thing. You have to know where my friend is. You have to be able to intercept our pre-arranged secure communication (I already remember all my friends keys when I first see them, you'll recall). If you're a casual attacker who just wants to sniff a couple of credit card numbers at the local starbucks, you probably don't have the resources to do that, even for a single individual.
It is definitely not easy to figure out where every single one of my currently-online friends - let's say Facebook friends, because you can maybe they finally care about security now - is online from, and also attack their networks simultaneously, to provide exactly the same bogus first-introduction certificate to Super Secure Bank Dot Com. This is a level of sophistication and coordination that not even most governments can muster.
So if we had a reasonably available mesh overlay network, where I can tell my friends, and my friends can tell their friends (etc forever) about first-introduction key correspondence with DNS names, and legitimate changes to keys where the site operator has had a security problem, then we could address many of these issues much more robustly than we can today. It might not be perfect, but it would silently work often enough that it would be much better than today's default of "bah, I don't know why you're getting the browser warning; just use HTTP".
Badump Ching
If you've been paying attention I think you can see where I am going with this.We (those of us in the open source hipster security noosphere) need to popularize this concept, because it's not that hard to implement, people keep re-inventing it everywhere, it's mostly just about getting some browser vendor to think it's a good idea.
The acronym is TOFU POP MONK, so clearly we need a vegetarian monk - buddhist seems most likely - who sings pop songs about how great tofu is. We need it to go viral on the you tubes, and any other tubes that are appropriate.
(Graphic design nerds, and sports racers of all stripes, start your engines. I challenge you. Show me some awesome macroable meme images starring the Tofu-Pop Monk. I will post any particularly compelling ones here.)
Saturday, December 04, 2010
Resolving diverged Bazaar branches on the go with 'dead heads'.
bzr: ERROR: These branches have diverged. Use the missing command to see how. Use the merge command to reconcile them.but you don't have time to do a merge, and wait for the (reliably agonizingly slow) network round trip to negotiate with the server about what the latest revision is - the train's about to leave, or you're late for your flight, or the cafe is closing and you need to shut your laptop right now. Sadness! You continue to work on a diverged branch and merge later. Which is a shame, because mechanically dealing with merge conflicts or just making sure the tests still pass after what looks like a trivial merge is exactly the sort of thing which is convenient to do when you're stuck waiting at a network-access-free bus stop.
As it turns out, Bazaar has actually already done all the hard work necessary for you to just go ahead and do that merge when you get to your potentially non-networked destination. The diverged revisions have already been pulled into your branch and are just sitting there, waiting to be merged, but you can't see them. The 'bzrtools' plugin provides the 'heads' command, which you can use to reveal the previously invisible revision. You can then just 'merge .' instead of merging from your usual pull location, as long as you specify the appropriate revision.
To demonstrate, here's a transcript of a sample session which simulates this common problem:
First, set up a branch:
you@computer:~$ mkdir tmp you@computer:~$ cd tmp you@computer:~/tmp$ mkdir a you@computer:~/tmp$ cd a you@computer:~/tmp/a$ bzr init Created a standalone tree (format: 2a) you@computer:~/tmp/a$ touch initial.txt you@computer:~/tmp/a$ bzr add adding initial.txt you@computer:~/tmp/a$ bzr ci -m "inital revision" Committing to: /Domicile/glyph/tmp/a/ added initial.txt Committed revision 1.We'll call 'a' the 'server' branch. Next, let's make a branch that represents the 'on the go' branch, your local working copy:
you@computer:~/tmp/a$ cd .. you@computer:~/tmp$ bzr get a b Branched 1 revision(s).Now, it's time to diverge. Let's give each branch its own revision.
you@computer:~/tmp$ cd a you@computer:~/tmp/a$ touch a.txt you@computer:~/tmp/a$ bzr add badding a.txt zyou@computer:~/tmp/a$ bzr ci -m 'revision from a' Committing to: /Domicile/glyph/tmp/a/ added a.txt Committed revision 2. you@computer:~/tmp/a$ cd ../b/ you@computer:~/tmp/b$ touch b.txt you@computer:~/tmp/b$ bzr add adding b.txt you@computer:~/tmp/b$ bzr ci -m 'revision from b' Committing to: /Domicile/glyph/tmp/b/ added b.txt Committed revision 2.Now, it's time to get on that sad, wifi-free train. Let's make sure we're up to date with 'a' first...
you@computer:~/tmp/b$ bzr pull ../a bzr: ERROR: These branches have diverged. Use the missing command to see how. Use the merge command to reconcile them. [Error: 3]Oh no! But, here comes 'bzr heads' to the rescue:
you@computer:~/tmp/b$ bzr heads --dead HEAD: revision-id: <strong>you@computer-123456</strong> (dead) committer: You <you@computer> branch nick: a timestamp: now-ish message: revision from aNow you know what the revision ID of the already-pulled-but-not-visible revision is - the tip of 'a', in other words. Now you just need to ask 'b' to merge it:
you@computer:~/tmp/b$ bzr merge . -r <strong>you@computer-123456</strong> +N a.txt All changes applied successfully. you@computer:~/tmp/b$ bzr ci -m 'merge from a' Committing to: /Domicile/glyph/tmp/b/ added a.txt Committed revision 3.Done! And, as you can see when you get back to your cozy 10gigE fiber connection at home, or whatever you happen to have, you see that the revision you've merged lines up neatly with 'a':
you@computer:~/tmp/b$ bzr pull ../a No revisions to pull. you@computer:~/tmp/b$Et voila. I hope this saves somebody some time when dealing with failed pulls.
For those of you who may be curious about the use-case, if you don't have it: I rarely encounter this with actual codebases I work on, as I tend to have a local trunk mirror, and features are neatly segregated into branches. It comes up more frequently in my personal configuration-files repository, where I make little changes to my desktop, little changes to my laptop, and then want to get out the door quickly with the latest merged copy. I was so happy when #bzr on freenode (thanks, spiv!) solved this problem for me that I just had to share.
Wednesday, January 06, 2010
Some Common Onomatological Errors
- the Twisted project
- the Twisted engine
- the Twisted networking engine
- the Twisted framework
I can understand that there is some confusion around this stuff, since these words often appear in close proximity, but to my knowledge there is nothing called "Python Twisted", "Twisted Python", or "Twisted Matrix". There's "python-twisted", which is the package name that some operating systems use to package Twisted. There is also "twisted.python", which is a python package within Twisted itself. Finally there is "twisted-python@twistedmatrix.com", which is the mailing list for discussing Twisted stuff in the Python programming language. (This discussion list is so named to distinguish it from the possibility of not-quite-hypothetical discussion of Twisted implemented in other languages, although no other implementations are currently actively maintained.)
I just thought you'd all like to know that. That is all. (For now, anyway.)

Saturday, October 24, 2009
Learn Twisted
Do you want WiFi to work at your conference?
Joel has a lot of what look like good recommendations. However, I can provide a much-abridged list.
Some years, WiFi access at PyCon US has been provided by the venue, or by a contractor whose name I mercifully do not know. Those years, it has not worked. Some years, it has been provided, or at least managed, by tummy.com. Those years, it has worked. They are probably much more critical of their own efforts than I am, as you can see in this thorough write-up that they did of PyCon's 2008 WiFi situation.
My two-step plan for you if you want your conference to have working WiFi access at your conference is:
- e-mail somebody at tummy.com, telling them that you want a working wireless network, and
- give them whatever they ask for.
Sunday, October 04, 2009
Hobgoblin History
The problem is, I already have solutions for all of these problems, and I don't quite understand why they don't (or shouldn't) work for me. (Since he organizes the post in terms of problems that existing systems have, I'm going to take the liberty of re-labeling these in terms of the problems that he seems to be describing rather than the lead text he used. Please post a comment if you think my labeling is wrong.)
In existing systems, Terry says:
"Things must be named, and have one name." Specifically, Terry calls out file systems. Except... file systems have lots of ways of introducing multiple names for the same thing. Symbolic links. Hard links, if you really want to allow for ambiguity. If you want to track that ambiguity, Windows "shortcuts" and MacOS "aliases" can do that. Overlay mounts, loopback mounts and chroot execution allow for semi-arbitrary renaming. Lots of other systems support this, too. Database systems have a specific provision for multiple names: the many-to-one relation. Any programming language with pass-by-reference data structures allows for some level of multiple-naming. In fact, there's a whole discipline for allowing things to have lots of different names: indexing. Anywhere you have a full-text index or an object where multiple attributes are indexed in some kind of database, you've got objects with more than one name.
"You have to be consistent and unambiguous." As I mentioned on the first point, there are lots of ways to be slightly ambiguous at a human level. You can refer to the same thing by different names, or, with mutable binding, you can refer to the different things with the same name. In some circumstances, you must be precise, but that's because fundamentally, algorithmic thinking requries a certain level of precision, not because of any specific problem with computers. In fact, there is a word for inconsistency and ambiguity in programming languages: polymorphism. Any time you invoke an interface rather than a concrete implementation (which is to say any time you do anything in a dynamic language like Python) you are being ambiguous and potentially inconsistent in your program's behavior.
"You only get one way to organize stuff." This is a pretty weak point, though, given that Terry himself immediately turns around and notices that tagging and other multiply-indexed database systems are becoming popular. So he gives us two examples of exceptions, but no examples of the rule. I'm not sure what I could add to that.
"Programmers are obsessed with "meaning"." On this one, I'm going to agree, except I don't think it's a problem. In the computational world, we are obsessed with the meaning of data, because if you get the meaning of the inputs wrong, then the meaning of the outputs is wrong too. For example: if you have a number that represents the total liabilities that your company has accumulated, it's pretty important that you don't ever treat that as your total profit. At a deeper level, if you have a sequence of bits that represents a floating-point number, it's important to know about its intended meaning, and not treat it as a string of characters, unless what you really want is a string. "@H=N" is not as useful a concept as "3.1287417411804199" if you are trying to add it to something. For what it's worth, I have my own, similar take on how we should treat computational objects that have multiple meanings: Imaginary. Even systems like Imaginary and FluidDB depend on a very rigid definition of some simpler concepts, like numbers consistently being numbers and words consistently being words. In my view, even if we treat the book itself as multifaceted, it's important to know what the data representing the "readable object" part of a book is really "about", and make sure it stays distinct from the data representing the "paperweight" part of the book. To be fair, FluidDB appears to do this itself — and this terminology is my least-favorite part of FluidDB — by having single-purpose, permission-controlled "objects" just like every other system, but calling them "tags", and re-using the word "objects" to refer instead to what others might call a "UUID" or "central index". In Imaginary, the system is similar; although the centrality of the FluidDB "object" (in Imaginary's case, the "Thing") is less stark; using FluidDB's terminology, in Imaginary, a "tag" can have a "tag" of its own; in fact, there's nothing but tags ("Items") anywhere.
"Metadata is separated from the data it describes." This may be true in some systems, but the web is probably the system with the most data in it anywhere, and in that system, metadata is always available as part of the request and the response. You can put in any headers you want in the response, and there are lots of pieces of metadata (like content-type) which are almost always found along with the data. In my opinion, the problem is more that we don't have enough of the previous problem. Web developers haven't been obsessed enough with meaning: there aren't enough useful conventions around the HTTP request/response metadata, and so it's hard to bundle more metadata in with your response and have it faithfully propagated elsewhere. We don't know what arbitrary headers might mean, because we don't have any way of expressing a schema for them.
Terry says he's going to write more about these problems, and the solutions that FluidDB provides for them. I'm looking forward to it. As part of that, I'd really like to see a clear description of how these problems affect me, or someone I know, either as a programmer or as a user. What do I, or should I, really want to do with some application right now that these five problems are preventing me from doing?
The reason I felt compelled to write about this is that history — and particularly the history of websites like freshmeat and sourceforge — is littered with the corpses of projects which promised to fundamentally change the way we represent data. A common problem with these projects is that they have expansive denunciations of current techniques to represent data, or manage persistence, and claim to provide an advance so significant that they will displace all current applications. What most of the people working on these projects don't realize is that the current techniques for representing data have a history, and there are good reasons for their limitations. Granted, not all of those reasons are currently relevant, and many are examples of path dependence, but it's still important to understand the reasons in order to escape the problems.
In FluidDB's case, I think that the problem isn't so much that Terry doesn't have the historical perspective, but that he assumes that we all do. And that we can all make the cognitive leap to see why FluidDB is necessary. But if I can't do it, I have to assume there are at least a few other programmers who aren't getting the message either.

Thursday, September 24, 2009
Diesel: A Case Study In That Thing I Just Said
Since the reaction to my reaction to tornado was so good (or at least so ... energetic), I figure I should comment on Diesel as well. Spoiler alert: my reaction is ... largely similar, but since jamwt has been kind of nice to Twisted in the past, and didn't actually say anything mean this time, I'm somewhat reluctant to have that reaction. Nevertheless, I swore a solemn oath to tell it like it is, keep it real, and soforth. So I must.
Once again, I'm happy that event-driven programming is getting some love. This time, I'm pleased that nobody is saying anything especially snarky or FUD-ish about Twisted. I do feel like it's a little weird not to mention Twisted, or include some comparisons to Nevow or Orbited, both of which provide different, comprehensive approaches to COMET with Twisted.
(Worth noting: Orbited also originally started out using its own event-driven I/O layer, but switched to Twisted later, because Twisted is "crazy delicious".)
Diesel has many more interesting ideas at the level of async I/O than Tornado did. I think the generator-based approach for implementing protocols is interesting and deserves some more exploration. I'm not sold on it for every use-case, and I think the implementation might have some flaws, but it definitely has some advantages.
I'd give jamwt a hard time for not reporting issues and communicating with Twisted more before re-writing the core, but for three issues:
- jamwt's been around in the Twisted community for a while. He's written a bunch of fairly deep Twisted code and he clearly knows what the framework is capable of.
- I've spoken with him on a number of occasions, and for all I know I might have discussed this with him. I don't remember it, but it would be pretty embarrassing to write a big rant about how nobody talks to us only to have him paste some chat log where he explained why he was writing Diesel six months ago, and I said "oh, okay" ;-).
- Nobody is calling Twisted names or making vague, unsubstantiated accusations. You're not obligated to examine Twisted, nor Nevow, nor Orbited, I just feel that you owe us some explanation if you publicly say that you tried it and found it wanting. The tone on the Diesel announcement, in its one brief mention of Twisted, is "we tried it, but we kinda wanted to do our own thing". So, good for them, they did their own thing, I hope they had fun.
Before I launch into my critique, I should say that I don't want to harsh on Diesel too bad. It's a neat little hack and you should go play with it. And I feel bad pointing out problems with it, since as I mentioned above, nobody's dumping on Twisted. So, Diesel fans, please take this in the spirit of a frank code-review, not a complaint about your behavior.
The interesting generator-munging bits could be easily adapted to run on top of Twisted's loop, which, arguably, they should have been in the first place; and the toy "hub" that they've written might be good enough for some simple applications where reliability under load is not a serious concern. In fact, inlineCallbacks might provide a good deal of what is needed to support Diesel's programming style. Alternately, Diesel might provide some hints as to how things like inlineCallbacks could be made more efficient.
That said, Diesel's I/O loop sucks.
It's disappointing to see the same mistakes getting made over and over again. First and foremost: no tests. Come on, Python community! You can do better! Write your damn tests first!
The #1 benefit that a brand-new I/O loop project could have over Twisted is that Twisted was written in the bad old days before everybody knew that TDD was the right way to write programs, so we don't have 100% test coverage. But, we strive to get closer every day, while every new project decides that they don't need no stinking quality control.
Predictably, as it has no tests, Diesel's I/O layer is full of dead code, inaccurate documentation, and unhandled errors. Consider this gem, which I found about 30 seconds into reading the code: KqueueEventHub is documented to be "an epoll-based event hub", and its initializer defines an inner function which is never used. I'm not going to belabor the point by enumerating all the typo bugs I found, but you may find the output of 'pyflakes diesel' interesting.
Instead of Tornado's inaccurate handling of EINTR, Diesel has no handling of EINTR, as far as I can tell. It also doesn't handle EPERM, ENOBUFS, EMFILE, or even EAGAIN on accept(). To be fair, it has a catch-all exception handler all the way at the top of the stack, so none of these will cause instant crashes, but they will cause surprising behavior in odd situations (and possibly infinite traceback-spewing loops).
More surprisingly - I had to re-read the code about five times to make sure - it doesn't appear that sockets are ever set to be non-blocking, and EAGAIN is not handled from accept(), recv(), or send(). And yes, this can happen even if your multiplexor says your socket is ready for reading and/or writing. The conditions are somewhat obscure, but nevertheless they do happen. So, occasionally, Diesel will hiccup and block until some slow network client manages to send or receive some traffic. In other words: Diesel is not really async. It just fakes it convincingly, most of the time.
Once again, there's no way to asynchronously spawn a process, and no way to asynchronously connect a TCP client. Sure, this looks like an asynchronous connect call, but it's misleading: it blocks on resolving the hostname, and it potentially blocks on the initial SYN/ACK/SYN+ACK exchange. There's no asynchronous SSL support. And no, that is not trivial. Not to mention handling all the crazy errors that spew out of the Windows TCP stack. And since the loop is implemented to be incompatible with Twisted, it's not obviously trivial to compatibly plug it in and get those features.
Again, I don't want to dump on Diesel here; for what it is, i.e. an experiment in how to idiomatically structure asynchronous applications, it's all right. For that matter Twisted has its fair share of bugs too, which would be pretty easy to lay out in a similar post; you wouldn't even need to do the research yourself, just go look at our bug tracker.
But both Diesel and Tornado make the mistake of attempting to replace the years of trial-and-error, years of testing discipline, and years of portability and feature work that Twisted has accumulated with a few oversimplified, untested hacks.
What they could have done is contributed any extensions that they needed to Twisted's loop, or modifications to Twisted's packaging that would allow them to get a smaller sliver of Twisted's core to bootstrap, if that's what they needed.
My goal in pointing out all these flaws is not to illustrate any particular point about Diesel, but to reinforce the point I implicitly made in my Tornado post, which is that if you try to write a new mainloop (especially without tests) you will screw it up. You will most likely screw it up in ways which will only surface later, under mysterious circumstances, when your servers are under load and you are under the gun for a deadline.
Or if I happen to get wind of it and write a blog post about it, of course. Then you get to cheat a little.
It's not an indictment of Diesel that it screwed this up; everyone screws it up. I would probably screw it up, if I didn't have Twisted sitting in front of me as a direct reference. POSIX by itself is unreasonably subtle and difficult, but POSIX, plus the subtle variations in different platforms which implement it, plus the Windows APIs which are almost-but-not-quite-exactly-nothing-like the POSIX APIs, presents an inhuman challenge.
Hopefully Diesel will grow some tests. Hopefully it will fix, or better yet shed, its somewhat unfortunate I/O hub. I am hopeful that someone will follow Dustin's excellent lead (perhaps Dustin himself!) and port Diesel's API and generator system over to Twisted's I/O architecture and eliminate all these silly bugs. Of course, it someone did that, you could use Dustin's tornado port with Diesel.
With the silly bugs from the I/O loop out of the way, the Diesel team can write tests for the more interesting pieces, and fix the bugs which aren't entirely silly :-).

Making Twisted Specific
"pffft. twisted isn't specific."The original goal of the Twisted project, as I have been frequently reminded of late, is to create a general, inter-operable mainloop that isn't specific to any particular protocol. The main loop wasn't a goal in itself, as the point of making it general was to provide an opportunity for all protocols could have serious, production-quality implementations that any Twisted application could have access to. Twisted itself ships with many different protocol implemenations in furtherance of this goal, in an attempt to get critical mass.
— W. Allen Short
This generality is a great strength. It means that we've attracted a small crowd of generalists. We have an excellent development process, ever-increasing quality of both code and documentation, and a wide variety of different protocol implementations and libraries for doing common networking and inter-process communication tasks. We have recently been lucky to attract a few more excellent developers to help with this.
The one thing we haven't been so lucky about is attracting specifists. Although we still need more people to make Twisted awesome as a library, our community is getting better and better at doing that. What we need even more than that is individuals with a very specific, focused interest on just one thing that Twisted does. Czars, if you will, to push the development of Twisted as a suite of interoperating applications.
Twisted already has within it the seeds of excellent replacements for Apache httpd, OpenSSH, BIND, hybrid ircd, Sendmail, imapd, pop3d, and a few other servers, not to mention clients like Pidgin and the OpenSSH command-line client. In order to sprout and take root, those seeds each need a dedicated advocate, someone who cares deeply about the experience of a user or administrator who just wants Twisted to perform one particular function and doesn't want to write their own application code to make it do that.
Projects like the ones above - OpenSSH and BIND, for example - have an advantage in becoming useful: they have dedicated people who care deeply about satisfying a particular use-case, and are singularly focused on that case. Since they only have the one problem to worry about, they can give it a much more direct treatment.
However, given the team of infrastructure programmers already working on Twisted, such a focused individual would have an incredible force multiplier. Consider the statistics on Conch from our 2003 USENIX paper on Twisted: going just by line count, Conch was 4x easier to write than even J2SSH, which was itself substantially smaller than OpenSSH. It was 10x easier to write than OpenSSH. So, with the support of Twisted as infrastructure, one Twisted application programmer can do the work of ten merely mortal ones ;-).
It might seem to those of you looking to write a chat client, DNS server, or whatever open-source giant that you want to do battle with, that Twisted is just a library, and you want to write an application. But we really want twisted to be a comprehensive suite of applications, we're just stretched too thin already to make it realize that potential.
So please rest assured that we would love to have your help with turning Twisted itself into a worthy competitor for these open-source giants - or, for that matter, if you want to build your own competitor as a layer on top of Twisted (for whatever reason: you love .ini files and we don't, you want a more freewheeling development process, or you want a different shade of green on your web pages) we'd still love to help you out and support that effort by fixing whatever issues you have with Twisted's core or protocols. There's even a super-project on Launchpad for Twisted-but-not-part-of-Twisted projects. I invite all you application developers out there to join that group and help us with world domination.
(If all that stuff about being ten times more effective as a programmer wasn't enough for you, how about this? On the Twisted Matrix Labs map of the post-revolutionary world, I'm pretty sure the Emancipated Territory of New Jersey is still missing an archduke and several viscounts. I can't make any promises, but if you get in on the ground floor of this thing there's still a chance you could be a ruling member of the Twisted over-government!)

Saturday, September 12, 2009
The Hole At The End Of The Pipe
While Mr. Resig isn't adamantly against "language abstractions" - he notes many of their benefits - his counterpoint is summed up in this paragraph:
In the case of these language abstractions you are gaining none of the benefit of learning the JavaScript language. When a leak in the abstraction occurs (and it will occur - just as it's bound to occur in any abstraction) what resources do you have, as a developer, to correct the problem? If you've learned nothing about JavaScript then you stand no chance in trying to repair, or work around, the issue.
This is becoming a popular fallacy in programming language circles; treating Joel Spolsky's "Law of Leaky Abstractions" as if it were an actual law.
Let's examine the metaphor of the "leak". In plumbing, a leak is a hole in a pipe where water gets out. Joel has noticed that every pipe has a hole in it, and therefore all pipes are leaky.
But that's not quite accurate. There's another hole in pipes where water gets out: it's called the "faucet", and without that part, the rest of the pipe is pretty useless. To say that a pipe whose faucet is turned on is "leaky" is somewhat misleading, just as it's misleading to say that an abstraction that propagates errors in its lower levels is misleading. Joel's entire original essay is based on a subtle (and, I suspect, intentional) misunderstanding of TCP: the error conditions that result from failures in the lower level, unreliable packet delivery mechanism are not leaks in the abstraction, they are very carefully specified and thoroughly documented. They are part of the abstraction. The abstraction of TCP does not try to pretend that connections are never broken, it just provides a unified idea of a "broken connection" that is clearly specified so you don't need to understand the five million ways that packet delivery can go wrong.
Put more simply: there are abstractions which do not leak. The example that Joel provides is one of them: TCP is a comprehensive abstraction.
Then there are abstractions which really do leak. Every object-relational mapper that provides a facility where you need to directly execute SQL, for example, is leaking the SQL through the abstraction. Every web templating framework where you can directly generate strings is leaky: the browser speaks DOM, and if you're generating strings, then bytes are leaking through the abstraction.
But "language abstractions" — or as those of us who are not hip to the new web lingo call them, "compilers" — are generally accepted to be the kind of thing that work well enough that you can trust them. I don't know the specifics of the current crop of javascript-targeting compilers. Maybe GWT and Pyjamas have issues that would require some knowledge of JavaScript to use them correctly. A well-written compiler, one that really lived up to the promise of treating the browser as a deployment target, wouldn't have those kinds of issues though. Let's turn the wayback machine to 1969 and cast Mr. Resig's argument against the contemporary contender for moving up the abstraction stack:
In the case of UNIX, you are gaining none of the benefit of learning the PDP-11 instruction set. When a bug in the C compiler occurs (and it will occur - just as it's bound to occur in any compiler) what resources do you have, as a developer, to correct the problem? If you've learned nothing about PDP-11 assembler then you stand no chance in trying to repair, or work around, the issue.
So, for those of you who work on UNIX-like operating systems using that fancy "C" machine-code abstraction: how much PDP-11 assembler have you written recently?

Tornado + Twisted
(The method it uses is currently a little weird, where you create a "Site" object, but it looks like it would be pretty simple to use a Resource instead if you were so inclined.)

What I Wish Tornado Were
Let me start with the good stuff. First of all, I think it's great that we have yet another asynchronous contender in the Python world. Every time something like this comes out, it means that Twisted has to fight that much less hard to get over the huge hump of event-driven programming being too hard, or too weird, or whatever. It's good to have an endorsement of the general message "if you need a web server to handle COMET requests, it needs to be asynchronous to perform acceptably" from such a high-profile company as Facebook.
Unfortunately I think the larger picture here is a failure of communication in the open source community. In the course of developing Tornado, there are several things that FriendFeed could have done to move the Twisted community forward, at no cost to themselves. I don't want to rag on FriendFeed, or Bret Taylor, or Facebook here; they're not the first to re-write something without communicating. In fact I recently had almost this exact same discussion with another project that did the same thing. Since Tornado is such a high-profile example, though, I want to draw attention to the problem so that there's some hope that maybe the next project won't forget to communicate first.
My main point here is that if you're about to undergo a re-write of a major project because it didn't meet some requirements that you had, please tell the project that you are rewriting what you are doing. In the best case scenario, someone involved with that project will say, "Oh, you've misunderstood the documentation, actually it does do that". In the worst case, you go ahead with your rewrite anyway, but there is some hope that you might be able to cooperate in the future, as the project gradually evolves to meet your requirements. Somewhere in the middle, you might be able to contribute a few small fixes rather than re-implementing the whole thing and maintaining it yourself.
This is especially important if you are later going to make claims about that project not living up to your vaguely-described requirements, and thereby damage its reputation. Bret Taylor claims in his blog:
We ended up writing our own web server and framework after looking at existing servers and tools like Twisted because none matched both our performance requirements and our ease-of-use requirements.
First and foremost, it would have been great to hear from Bret when he started off using Twisted about any performance problems or ease-of-use problems. I'm guessing that Twisted itself had only ease-of-use problems, and other "tools like Twisted" were the ones with performance problems, since later, in a comment on the same post, he says:
I can't imagine there is much of a performance difference [between Twisted Web and Tornado]. The bottom is not that complex in my opinion.
It would also be great if he had explicitly said that Twisted didn't have performance problems rather than making me guess, because I'm sure that is what lots of developers will take away from this. When you have the bully pulpit, off-the-cuff comments like this can do serious damage to smaller projects.
More to the point, what is the problem with "ease of use", exactly? The fact that he found Deferred tedious, in particular, seems very strange to me, given that it is so un-tedious that it has become a de-facto standard even in the JavaScript community. We had no opportunity to help him or anyone else out, because as far as I can tell from searching our archives, we never heard from him or from anyone else at FriendFeed when they were trying out Twisted at first. Even as he's saying that Twisted is hard to use and (maybe?) performs poorly, he isn't pointing to any particular example of what about it is hard to use, or what performs poorly. There's still nothing we can do to address this criticism. And there's still not much we can do to make sure that future potential Twisted users won't have this problem.
Later, in yet another comment, Bret points out the root problem:
This is true. However, as I frequently like to note, Twisted is starved for resources. Reconciling the chaos described on the page about web development with Twisted is an ongoing process. For a tiny fraction of the effort invested in Tornado, FriendFeed could have worked with us to resolve many of the issues creating that chaos.
This is the main thing I want to reinforce here. If half a dozen occasional contributors with a real focused interest in web development showed up to help us on Twisted, we'd have an awesome, polished web story within a few months. If even one person really took responsibility for twisted.web, things would pick up. But if everyone who wants an asynchronous webserver either uses twisted.web (because it's great!) without talking to us or decides not to use it (because it doesn't meet their unstated requirements) without talking to us, it's going to continue to improve at the same sluggish pace.
Even at the current rate, by the time we have an excellent HTTP story, I somehow doubt that Tornado will have a good SSHv2 protocol story ;-).
In his comment, Bret also takes a couple of pot-shots at Twisted that I think are unnecessary, and I'd like to address those too.
In general, it seems like Twisted is full of demo-quality stuff, but most of the protocols have tons of bugs.
We're not talking about "most" of the protocols here, Tornado is only concerned with HTTP. And the HTTP implementation(s) in Twisted do not have "tons of bugs". They are production quality, used on lots of different websites, and have lots of automated tests. While much of the code in twisted.web doesn't have complete test coverage, since it's old enough to predate our testing requirements, I note that Tornado appears to have zero test coverage.
There's a kernel of truth here — some of the older, less frequently used protocols have a few problems — but in most cases the "bugs" are really just a lack of functionality. Twisted overall has very few protocol-related bugs, and again, our test policy makes sure that we have get new bugs very rarely.
Given all those factors, it didn't seem to provide a lot of value. Our core I/O loop is actually pretty small and simple, and I think resulted in fewer bugs than would have come up if we had used Twisted.
I must respectfully disagree. Again, I don't want to rag on FriendFeed here, but here are several features that Tornado would have, and bugs that it wouldn't have, if it used Twisted for the event loop and none of the HTTP stuff:
- EINTR wouldn't cause your application to exit if run in a non-US-english locale.
- You don't have the opportunity to forget to set a socket to be non-blocking and thereby make your entire application stop.
- It would be possible to run your application on Windows.
- Firewalled connections and running out of file descriptors wouldn't cause your server to spew errors forever (at least, it won't any more).
- You could write a TCP client that didn't block for an arbitrary amount of time in connect().
- Finally, of course, you could use all of Twisted's other protocols, client and server: IMAP, POP, SMTP, IRC, AIM, etc. You could also use external protocol implementations like Thift.
- You could spawn asynchronous subprocesses.
This list is a great example of why projects like Tornado really should use Twisted. Tornado implements some innovative web-framework stuff, but absolutely nothing interesting that I can see at the level of async I/O. Using Twisted would have allowed them to focus exclusively on cool web things and left the never-ending stream of incremental surprising platform-specific, only-happens-in-weird-situations bugfixes to a single, common source.
What To Do Now
I hope that someone at FriendFeed will be a little heavier on detail and a little lighter on FUD in some future conversation about Twisted. However, I'm sure they're going to have their hands full maintaining their own code, so I don't have high expectations in this area. I'm sure Bret wasn't intentionally slamming Twisted, either; it wasn't like he wrote a big screed about it, he just dropped in a few unsubstantiated comments into a much larger post about Tornado. So I just want to be clear: I don't have sore feelings, I don't need anybody to apologize to me or to Twisted.If any of you out there are fans of both Tornado and of Twisted, it would be great if you could contribute a patch to Tornado which would allow it to at least optionally use Twisted as an I/O back-end. It would be great, of course, if lots of people interested in web stuff would help us out with our web situation, but supporting the Twisted event loop would be good regardless. It would mean that when people wanted to speak multiple protocols, they wouldn't need to re-write or kludge in their existing Tornado application, so it would increase the chances that we could get some help with our SSH, FTP, IRC, or XMPP code instead. It would also open up a much wider multi-protocol landscape to users of Tornado, even if Tornado's default mode of operation still used ioloop.py.
Even better would be to hook up something that made a Tornado IResource implementation, so that Tornado applications and twisted.web and Nevow applications could all be seamlessly integrated into one server.
The whole point of Twisted is to have a common I/O layer that lots of different libraries can use, share, and build on, so that we can solidify the common and highly complex abstraction required of a comprehensive, cross-platform, event-driven I/O layer. In order to realize that vision, we need help not just with the code; we need more Twisted ambassadors to go out into the community and help us integrate these disparate applications, help us find out where real users are finding the documentation inadequate or the organization confusing.
Tornado could be an excellent opportunity for those ambassadors to go out and introduce others to the wonders of Twisted, because its endorsement from FriendFeed guarantees it an audience of a tens of thousands of developers, at least for its first few months of life. If you've shied away from contributing to Twisted itself because of our aggressive testing and documentation requirements, well, Tornado apparently doesn't have any, so it would be a great place for you to start :).

Friday, September 11, 2009
The Web, Untangled
First, let me tell you what the answer isn't. It isn't a continuation of the traditional "web framework" strategy. These have been important tools in dredging the conceptual mire of the web for useful patterns, and at this point in history they have a long life ahead of them. I'm not predicting the death of Django or Rails any time soon. Django and Rails are the stucco of the web. An important architectural innovation, to be sure: they let you cover over the materials underneath, allowing you to build structures that are appealing without fundamentally changing the necessarily ugly underpinnings. But you can't build a skyscraper out of stucco.
As Jacob covered in great detail in his talk, innovations in the "framework" space generally involve building more and more abstractions, creating more and more new concepts to simplify the underlying concepts. Eventually you run out of brain-space for new concepts, though, and you have to start over.
I started here by saying that we're stuck with the web. If we can understand why we're stuck with the web, we can make it a pleasant place to be. Of course everybody has their own ideas about what makes the web great, but it's important to remember that none of that is what makes the web necessary.
What makes the web necessary is very simple: a web browser is a turing complete execution environment, and everyone has one. It's also got a feature-complete (if highly idiosyncratic) widget set, so you can display text, images, buttons, scrollbars, and menus, and compose your own widgets (sort of). Most importantly, it executes code without prompting the user, which means the barrier to adoption of new applications is at zero. Not to mention that, thanks to the huge ecosystem of existing applications, the user is probably already running a web browser.
I feel it's important to emphasize this point. When developing an application, delivery is king. It doesn't matter how great your application is if no users ever run it, and given how incredibly cheap in terms of user effort it is to run an application in a web browser, your application has to be really, really awesome to get them to do more work than clicking on a link. I can't find the article, but I believe Three Rings once did an interview where they explained that some huge percentage of users (if I remember correctly, something like 90%) will leave immediately if you make them click on a "download" link to play the game, but they'll stick around if you can manage to keep it in the browser without making them download a plugin.
Improvements to ECMAScript and HTML sound fun, but if, tomorrow morning, somebody figured out how to securely execute x86 machine code on web browsers, and distribute that capability to every browser on the internet, developers would start using that almost immediately. HTML-based applications would slowly die out, as their UIs would be comparatively slow, clunky, and limited.
Tools like the Google Web Toolkit (and Pyjamas, its Python clone), recognized this fact early on. They treat the browser as what the browser should be: a dumb run-time. A deployment target, not a development environment. Seen in this light, it's possible to create layers for integration and inter-op above the complexity soup of DOM and JavaScript: despite the fact that the browser itself has no "linker" to speak of, and no direct support for library code, with GWT you get Java's library mechanism.
Although it's not particularly well-maintained, PyPy also has a JavaScript back-end, which allows you to run a restricted subset of Python ("RPython") in a web browser; I hope that in the future this will be expanded to give us a more realistic, full-featured Python VM in the browser than Pyjamas' fairly simplistic translation currently does. In opposition to the "worrying trend" that Jacob noted, with individual applications needing to write new, custom run-times, they leverage an existing language ecosystem rather than inventing something new.
Using tools like these, you can write code in the same language client-side and server-side. This simplifies testing. You can at least get basic test coverage in one pass, in one runtime, even if some of that code will actually run in a different runtime later. It simplifies implementation and maintenance, too. You can write functions and then decide to run them somewhere else based on deployment, security, or performance concerns without necessarily rewriting them from scratch.
If toolkits like these gained more traction, it would go a long way towards interop, too. It would be a lot easier to have an FFI between Python-in-the-browser and Java-in-the-browser than to try to wrangle every possible JavaScript hack in the book. Similarly on the server side: once a few frameworks can standardize on rich client-server communication channels, it will be easier to have a high-level abstraction over those than over the mess of XmlHttpRequest and its various work-alikes.
There's still an important component missing, though. Web applications almost always have 3 tiers. I've already discussed what should happen on the first tier, the browser. And, as GWT, NaCl and Pyjamas indicate, there are folks already hard at work on that. The middle tier is basically already okay; server-side frameworks allow you to work with "business logic" in a fairly sane manner. What about the database tier?
The most common complaint about the database tier is security. Since half the time your middle tier needs to be generating strings of SQL to send to the database, there are a plethora of situations where an accidental side-channel is created, allowing users to directly access the database.
This is a much more tractable problem than the front-end problem. For one thing, a really well-written framework, one which doesn't encourage you to execute SQL directly, can comprehensively deal with the security issue. Similarly, a good ORM will allow you complete access to the useful features of your database without forcing you to write code in two different programming languages.
Still, there's a huge amount of wasted effort on the database side of things. Pretty much every major database system has a sophisticated permission system that nobody really uses. If you want to write stored procedures, triggers, or constraints in a language like Python, it is at worst impossible and at best completely non-standard and very confusing. Finally, if you want to test anything... you're not entirely on your own, but it's going to be quite a bit harder than testing your middle-tier code.
One part of the solution to this problem comes, oddly enough, from Microsoft: LINQ, the Langauge Integrated Query component, provides a standard syntax and runtime model for queries executed in multiple different languages. More than providing a nice wrapper over database queries, it allows you to use the same query over in-memory objects with no "database engine" to speak of. So you can write and test your LINQ code in such a way that you don't need to talk to a database. When you hook it up to a database, your application code doesn't even really need to know.
The other part of the solution comes from SQLite. Right now, managing the deployment of and connection to a database is a hugely complex problem. You have to install the software, write some config files, initialize your database, grant permissions to your application user, somehow get credentials from the database to the application, connect from the application to the database, and verify that the database's schema is the same as what the application expects. And that's before you can even do anything! Once you're up and running, you need to manage upgrades, schedule downtime for updating the database software (independently of upgrading the application software). Note that the database can't be a complete solution for the application's persistence needs, either, because in order to tell the application where it needs to find the rest of its data, you need, at the very least, a hostname, username, and password for the database server.
All of this makes testing more difficult - with all those manual steps, how can you really know if your production configuration is the same as your test configuration? It also makes development more difficult: if automatically spinning up a new database instance is hard, then you end up with a slightly-nonstandard manual database setup for each developer. With SQLite, you can just say "database, please!" from your application code, specifying all the interesting configuration right there.
Finally, SQLite allows you to very easily write stored procedures and triggers in your "native" language. You also don't need to quite as much, because your application can much more easily completely control access to its database, but if you want to work in the relational model it's fairly simple. The stored procedures are just in memory, and are called like regular functions, not in an obscure embedded database environment.
In other words, for modern web applications, a database engine is really just a library. The easier it is to treat it like one, the easier it is to deploy and manage your application.
In the framework of the future, I believe you'll be able to write UI code in Python, model code in Python, and data-manipulation code in Python. When you need to round a number to two digits, you'll just round it off, and it'll come out right.

Thursday, September 10, 2009
Oh <what> a.tangled {web, we} WEAVE FROM
What sucks about web development? How will we fix it? How has python fixed it, and how will python fix it in the future? While I can't say I agree with every answer, I found myself nodding quite a bit, and he has something useful to say on just about every point.
I noticed one very important question he leaves out of the mix, though, which seems more fundamental than the others: why does web development suck? In particular, why do so many people who are familiar with multiple styles of development feel like developing for the web is particularly painful by comparison, while so much of software development moves to the web? And, why does web development in Python suck, despite the fact that otherwise, Python mostly rocks?
Programming for the web lacks an important component, one that Fred Brooks identified as crucial for all software as early as 1975: conceptual integrity. Put more simply, it is difficult to make sense of "web" programs. They're difficult to read, difficult to write and difficult to modify, because none of the pieces fits together in a way which can be understood using a simple conceptual model.
Rather than approach this head on, from the perspective of a working web programmer, let's start earlier than that. Let's say someone approached you with a simple programming task: write an accounting system that includes point-of-sale software to run a small business. Now, considering some imagined requirements for such a system, how many languages would you recommend that it be written in?
Most working programmers would usually say "one" without a second thought. A too-clever-by-half language nerd might instead answer "two, a general-purpose programming language for most things and a domain specific language to describe accounting rules and promotions for the business". Why this number? Simply put, there's no reason to use more, and introducing additional languages means mastering additional skills and becoming familiar with additional quirks, all of which add to initial development time and maintenance overhead. Modern programming languages are powerful enough to perform lots of different types of tasks, and are portable across both different computer architectures and different operating systems, so other concerns rarely intrude.
But, in the practical, working programmer's world, what's the web's answer to this question? Six. You have to learn six languages to work on the web:
- HTML. This isn't really a programming language, but in web development you do end up reading and writing quite a lot of it.
- CSS. In order to apply visual styles to your HTML so that it actually looks nice in a browser, you need to understand a different language (with a different conceptual model for how documents are laid out than the HTML itself).
- JavaScript. In today's competitive AJAX-y world, you need to be able to react instantly in the browser, writing a real client application.
- SQL, so that you can store your data in a database.
- Your "middle-tier" language: in my case and Jacob's, that would be Python. This is where people tend to spend the bulk of their programming time, but not all of it.
- A templating language; in Jacob's case, the Django template language.
Of course, Jacob lists a pile of related technologies too, and rightly points out that it's a lot to keep in your head. But he is talking about a problem of needing extensive technical knowledge, something which all programmers working in a particular technology ecosystem learn sooner or later. I'm talking about a different, more fundamental problem: in addition to the surface problem of being complex and often broken, these technologies are fundamentally conceptually incompatible, which leads to a whole host of other problems. Furthermore, the only component which is really complete is the "middle-tier" language, although bespoke web-only languages like PHP and Arc manage to screw that up too.
Here are a few simple example problems that are made depressingly complex by the impedence mismatch between two of these components, but which are incredibly easy using a different paradigm.
How do you place two boxes with text in them side-by-side? Using a GUI toolkit, like my favorite PyGTK, it often goes something like this:
left = Label("some text")The conceptual model here is simple: the HBox() is a container, the "left" and "right" things are widgets, which are in that container. You can add them, remove them, swap them, or handle events on them easily. You can discover how these things are done by reading the API references for the appropriate classes of object. However, there's no right answer to this question on the web. You can use a <table> tag, and then some <tr>s and <td>s to make a single-row table with two cells, but that has a variety of limitations; plus, it's considered somehow gauche by most web designers to use tables for layout these days. Or, you could cook up a collection of CSS classes. So there's the first impedence mismatch: do you do layout in HTML, or CSS? Of course most design gurus would like to tell you that "always and only CSS" is the right answer here, but more practically-minded web developers who actually write code will often prefer HTML, partially because it's simpler but partially because CSS's featureset is incomplete and there are some things you can still only do with HTML, or only do portably with HTML.
right = Label("some other text")
box = HBox()
box.add(left)
box.add(right)
Plus, how do you discover how these layouts work? There are a variety of reference materials, but no canonical guide that says "this is exactly what a <table> tag should do, and how it should look". There are different forms of documentation for both.
If you have a variable number of elements, you quickly run into another problem. Should this be the responsibility of the HTML, the CSS, or some code (in the templating layer) that emits some HTML or some CSS? Should the code in the templating layer be written as an invocation of your middle-tier language, or should the template language itself have some code in it? Reasonable people of good conscience disagee with each other in every possible way over every one of these details.
This is all part of a very complex problem though. For all of these crazy hoops you have to jump through, HTML and CSS do provide a layout model that allows you to do some very pretty and very flexible things with layout, especially if you have large amounts of text. Perhaps not as good as even the most basic pre-press layout engine, but still better than the built-in stuff that most GUI toolkits allow you. So there is an argument that this complexity is a trade-off, where you get functionality in exchange for the confusion. So let's look at a much simpler problem.
Let's say that, in our hypothetical accounting application, you have a list of items in a retail transaction, and you want to process the list and produce a sum. Where is the right place to do that? It turns out you have to write the code to do that three times.
First, you have to write it in JavaScript. After all, the numbers are all already in the client / browser, and you want to update the page instantaneously, not wait for some potentially heavily-loaded server to get back to you each time the user presses a keystroke. And why not? You've got plenty of processing power available on the client.
Then you have to write it in Python. That's where the real brain of the application lives, after all, and if you're going to do something like send a job to a receipt printer or email a customer or sales representative some information in response to a sale, the number has to be located in the middle tier.
Finally you have to do it in SQL. Since this is a traditional web application, your Python code is going to be spread out among multiple servers, and the database is the ultimate arbiter of recorded truth. So you need to have transactions around the appropriate points and execute any interesting aggregate functions (such as SUM()) in the database tier.
So, you've got three times as much work to do in your fancy new web application as you would in a simple record-based application with a GUI. A worthy price to pay to run in the brave new world of tomorrow rather than on some crusty old client/server system, right?
Well, as it turns out, the problem is somewhat deeper than that. It turns out that JavaScript, Python, and SQL actually have slightly different numerical models (in fact Python implements at least 4 itself: fixed-point decimal, floating-point decimal, IEEE 754 floating-point binary, and integer math; you should really only use decimal for money, but this isn't availble in JavaScript and its availability in SQL is spotty). After applying some discounts, your register might read $19.74 but your receipt will read $19.75; and the reports sent to the accounting department will read $19.74898989898989.
Even if you know a lot about math on computers, the limitations of each of these runtimes, and you happen to get all of that just right, you still have another problem to contend with: what happens when somebody else needs to change the logic in question? How do you test that the Python, the JavaScript, and the SQL are all still in sync? It's possible, but you have to go above and beyond the usual discipline of test-driven development, because you need to have integration tests that verify that different, almost unrelated code, in different languages, in different environments is all executing properly in lock-step. Just getting the code from SQL and JavaScript to run in your Python test suite at all is a major challenge; in a language like PHP it's borderline impossible.
This is all even worse when it comes to security, because every part of the application exposes an attack surface, and because you can't use the same language or the same libraries to do any of the work, they all expose a different attack surface.
In his talk, Jacob notes that "frameworks suck at inter-op", but the problem is much deeper than that. As I've shown here, a single page from a single application written using a single framework, which has only one task to do, can't even inter-operate with itself cleanly, at least not at the level that Jacob wants — or that I want. He says, "gateways aren't APIs", and he's right: the correct way to inter-operate is through well-defined APIs. APIs can be discovered through a single, consistent process. Their implementations can be debugged using a single set of development tools.
CSS isn't an API. HTML isn't an API. Strings containing a hodgepodge of SQL and data aren't an API either.
It's not all doom and gloom, but my ideas for a future solution to this problem will have to wait for another post.

