Deciphering Glyph

A collection of articles, ideas, and rambling from a guy who wrote some software that one time.

Wednesday, January 06, 2010

Some Common Onomatological Errors

The open-source event-driven networking engine that I work on is called "Twisted".  If you're uncomfortable using something that sounds like an adjective in a place where a noun should go, the following noun phrases are equivalent:
  1. the Twisted project
  2. the Twisted engine
  3. the Twisted networking engine
  4. the Twisted framework
The unofficial group (of which I am a member) which works on that software is known as "Twisted Matrix Laboratories", sometimes shortened to "Twisted Matrix Labs" or "TMLabs".

I can understand that there is some confusion around this stuff, since these words often appear in close proximity, but to my knowledge there is nothing called "Python Twisted", "Twisted Python", or "Twisted Matrix".  There's "python-twisted", which is the package name that some operating systems use to package Twisted.  There is also "twisted.python", which is a python  package within Twisted itself.  Finally there is "twisted-python@twistedmatrix.com", which is the mailing list for discussing Twisted stuff in the Python programming language.  (This discussion list is so named to distinguish it from the possibility of not-quite-hypothetical discussion of Twisted implemented in other languages, although no other implementations are currently actively maintained.)

I just thought you'd all like to know that.  That is all.  (For now, anyway.)

Saturday, October 24, 2009

Learn Twisted

Jean-Paul Calderone continues his excellent "Twisted Web In 60 Seconds" tutorial series.  If you haven't checked it out yet, you should!

Friday, October 23, 2009

Do you want WiFi to work at your conference?

I've been pretty busy for the last couple of weeks, so I've just had an opportunity to catch up with blog posts that have been piling up.  In particular I noticed this one: The “WiFi At Conferences” Problem, by Joel Spolsky.

Joel has a lot of what look like good recommendations.  However, I can provide a much-abridged list.

Some years, WiFi access at PyCon US has been provided by the venue, or by a contractor whose name I mercifully do not know.  Those years, it has not worked.  Some years, it has been provided, or at least managed, by tummy.com.  Those years, it has worked.  They are probably much more critical of their own efforts than I am, as you can see in this thorough write-up that they did of PyCon's 2008 WiFi situation.

My two-step plan for you if you want your conference to have working WiFi access at your conference is:
  1. e-mail somebody at tummy.com, telling them that you want a working wireless network, and
  2. give them whatever they ask for.
If you do these things, then when people open their laptops at your conference, their networks will work.

Sunday, October 04, 2009

Hobgoblin History

I like Terry Jones; I think FluidDB has a lot of potential.  But, sometimes when he's talking about it, he gets a little carried away and forgets that the rest of us don't live in his future yet.  In his latest missive on the official FluidDB blog, "Digital Hobgoblins", he describes some of the problems that FluidDB sets out to solve.

The problem is, I already have solutions for all of these problems, and I don't quite understand why they don't (or shouldn't) work for me.  (Since he organizes the post in terms of problems that existing systems have, I'm going to take the liberty of re-labeling these in terms of the problems that he seems to be describing rather than the lead text he used.  Please post a comment if you think my labeling is wrong.)

In existing systems, Terry says:

"Things must be named, and have one name."  Specifically, Terry calls out file systems.  Except... file systems have lots of ways of introducing multiple names for the same thing.  Symbolic links.  Hard links, if you really want to allow for ambiguity.  If you want to track that ambiguity, Windows "shortcuts" and MacOS "aliases" can do that.  Overlay mounts, loopback mounts and chroot execution allow for semi-arbitrary renaming.  Lots of other systems support this, too.  Database systems have a specific provision for multiple names: the many-to-one relation.  Any programming language with pass-by-reference data structures allows for some level of multiple-naming.  In fact, there's a whole discipline for allowing things to have lots of different names: indexing.  Anywhere you have a full-text index or an object where multiple attributes are indexed in some kind of database, you've got objects with more than one name.

"You have to be consistent and unambiguous."  As I mentioned on the first point, there are lots of ways to be slightly ambiguous at a human level.  You can refer to the same thing by different names, or, with mutable binding, you can refer to the different things with the same name.  In some circumstances, you must be precise, but that's because fundamentally, algorithmic thinking requries a certain level of precision, not because of any specific problem with computers.  In fact, there is a word for inconsistency and ambiguity in programming languages: polymorphism.  Any time you invoke an interface rather than a concrete implementation (which is to say any time you do anything in a dynamic language like Python) you are being ambiguous and potentially inconsistent in your program's behavior.

"You only get one way to organize stuff."  This is a pretty weak point, though, given that Terry himself immediately turns around and notices that tagging and other multiply-indexed database systems are becoming popular.  So he gives us two examples of exceptions, but no examples of the rule.  I'm not sure what I could add to that.

"Programmers are obsessed with "meaning"."  On this one, I'm going to agree, except I don't think it's a problem.  In the computational world, we are obsessed with the meaning of data, because if you get the meaning of the inputs wrong, then the meaning of the outputs is wrong too.  For example: if you have a number that represents the total liabilities that your company has accumulated, it's pretty important that you don't ever treat that as your total profit.  At a deeper level, if you have a sequence of bits that represents a floating-point number, it's important to know about its intended meaning, and not treat it as a string of characters, unless what you really want is a string.  "@H=N" is not as useful a concept as "3.1287417411804199" if you are trying to add it to something.  For what it's worth, I have my own, similar take on how we should treat computational objects that have multiple meanings: Imaginary. Even systems like Imaginary and FluidDB depend on a very rigid definition of some simpler concepts, like numbers consistently being numbers and words consistently being words.  In my view, even if we treat the book itself as multifaceted, it's important to know what the data representing the "readable object" part of a book is really "about", and make sure it stays distinct from the data representing the "paperweight" part of the book.  To be fair, FluidDB appears to do this itself — and this terminology is my least-favorite part of FluidDB — by having single-purpose, permission-controlled "objects" just like every other system, but calling them "tags", and re-using the word "objects" to refer instead to what others might call a "UUID" or "central index".  In Imaginary, the system is similar; although the centrality of the FluidDB "object" (in Imaginary's case, the "Thing") is less stark; using FluidDB's terminology, in Imaginary, a "tag" can have a "tag" of its own; in fact, there's nothing but tags ("Items") anywhere.

"Metadata is separated from the data it describes."  This may be true in some systems, but the web is probably the system with the most data in it anywhere, and in that system, metadata is always available as part of the request and the response.  You can put in any headers you want in the response, and there are lots of pieces of metadata (like content-type) which are almost always found along with the data.  In my opinion, the problem is more that we don't have enough of the previous problem.  Web developers haven't been obsessed enough with meaning: there aren't enough useful conventions around the HTTP request/response metadata, and so it's hard to bundle more metadata in with your response and have it faithfully propagated elsewhere.  We don't know what arbitrary headers might mean, because we don't have any way of expressing a schema for them.

Terry says he's going to write more about these problems, and the solutions that FluidDB provides for them.  I'm looking forward to it.  As part of that, I'd really like to see a clear description of how these problems affect me, or someone I know, either as a programmer or as a user.  What do I, or should I, really want to do with some application right now that these five problems are preventing me from doing?

The reason I felt compelled to write about this is that history — and particularly the history of websites like freshmeat and sourceforge — is littered with the corpses of projects which promised to fundamentally change the way we represent data.  A common problem with these projects is that they have expansive denunciations of current techniques to represent data, or manage persistence, and claim to provide an advance so significant that they will displace all current applications.  What most of the people working on these projects don't realize is that the current techniques for representing data have a history, and there are good reasons for their limitations.  Granted, not all of those reasons are currently relevant, and many are examples of path dependence, but it's still important to understand the reasons in order to escape the problems.

In FluidDB's case, I think that the problem isn't so much that Terry doesn't have the historical perspective, but that he assumes that we all do.  And that we can all make the cognitive leap to see why FluidDB is necessary.  But if I can't do it, I have to assume there are at least a few other programmers who aren't getting the message either.

Thursday, September 24, 2009

Diesel: A Case Study In That Thing I Just Said

Thanks to jamwt for the shout-out on the announcement of Diesel.

Since the reaction to my reaction to tornado was so good (or at least so ... energetic), I figure I should comment on Diesel as well.  Spoiler alert: my reaction is ... largely similar, but since jamwt has been kind of nice to Twisted in the past, and didn't actually say anything mean this time, I'm somewhat reluctant to have that reaction.  Nevertheless, I swore a solemn oath to tell it like it is, keep it real, and soforth.  So I must.

Once again, I'm happy that event-driven programming is getting some love.  This time, I'm pleased that nobody is saying anything especially snarky or FUD-ish about Twisted.  I do feel like it's a little weird not to mention Twisted, or include some comparisons to Nevow or Orbited, both of which provide different, comprehensive approaches to COMET with Twisted.

(Worth noting: Orbited also originally started out using its own event-driven I/O layer, but switched to Twisted later, because Twisted is "crazy delicious".)

Diesel has many more interesting ideas at the level of async I/O than Tornado did.  I think the generator-based approach for implementing protocols is interesting and deserves some more exploration.  I'm not sold on it for every use-case, and I think the implementation might have some flaws, but it definitely has some advantages.

I'd give jamwt a hard time for not reporting issues and communicating with Twisted more before re-writing the core, but for three issues:
  1. jamwt's been around in the Twisted community for a while.  He's written a bunch of fairly deep Twisted code and he clearly knows what the framework is capable of.
  2. I've spoken with him on a number of occasions, and for all I know I might have discussed this with him.  I don't remember it, but it would be pretty embarrassing to write a big rant about how nobody talks to us only to have him paste some chat log where he explained why he was writing Diesel six months ago, and I said "oh, okay" ;-).
  3. Nobody is calling Twisted names or making vague, unsubstantiated accusations.  You're not obligated to examine Twisted, nor Nevow, nor Orbited, I just feel that you owe us some explanation if you publicly say that you tried it and found it wanting.  The tone on the Diesel announcement, in its one brief mention of Twisted, is "we tried it, but we kinda wanted to do our own thing".  So, good for them, they did their own thing, I hope they had fun.
Now, personally, I'd like to leave it at that, but there is a certain inevitable comparison that I think is going to take place.  Diesel has a nicer web page than Twisted.  They have entwittered ... twitified ... uh ... tweetened ... the project, and we haven't; we just have an old-fashioned "blog".  Diesel is smaller than Twisted, so it's easier to explain, and so the people approaching it will have a better idea of its scope.  This might give the immediate impression that it is a simpler, better, more "modern" replacement for Twisted's I/O layer, and this is not the case.  So I still feel it's important that I set the record straight.

Before I launch into my critique, I should say that I don't want to harsh on Diesel too bad. It's a neat little hack and you should go play with it.  And I feel bad pointing out problems with it, since as I mentioned above, nobody's dumping on Twisted.  So, Diesel fans, please take this in the spirit of a frank code-review, not a complaint about your behavior.

The interesting generator-munging bits could be easily adapted to run on top of Twisted's loop, which, arguably, they should have been in the first place; and the toy "hub" that they've written might be good enough for some simple applications where reliability under load is not a serious concern.  In fact, inlineCallbacks might provide a good deal of what is needed to support Diesel's programming style.  Alternately, Diesel might provide some hints as to how things like inlineCallbacks could be made more efficient.

That said, Diesel's I/O loop sucks.

It's disappointing to see the same mistakes getting made over and over again.  First and foremost: no tests.  Come on, Python community!  You can do better!  Write your damn tests first!

The #1 benefit that a brand-new I/O loop project could have over Twisted is that Twisted was written in the bad old days before everybody knew that TDD was the right way to write programs, so we don't have 100% test coverage.  But, we strive to get closer every day, while every new project decides that they don't need no stinking quality control.

Predictably, as it has no tests, Diesel's I/O layer is full of dead code, inaccurate  documentation, and unhandled errors.  Consider this gem, which I found about 30 seconds into reading the code: KqueueEventHub is documented to be "an epoll-based event hub", and its initializer defines an inner function which is never used.  I'm not going to belabor the point by enumerating all the typo bugs I found, but you may find the output of 'pyflakes diesel' interesting.

Instead of Tornado's inaccurate handling of EINTR, Diesel has no handling of EINTR, as far as I can tell.  It also doesn't handle EPERM, ENOBUFS, EMFILE, or even EAGAIN on accept().  To be fair, it has a catch-all exception handler all the way at the top of the stack, so none of these will cause instant crashes, but they will cause surprising behavior in odd situations (and possibly infinite traceback-spewing loops).

More surprisingly - I had to re-read the code about five times to make sure - it doesn't appear that sockets are ever set to be non-blocking, and EAGAIN is not handled from accept(), recv(), or send().  And yes, this can happen even if your multiplexor says your socket is ready for reading and/or writing.  The conditions are somewhat obscure, but nevertheless they do happen.  So, occasionally, Diesel will hiccup and block until some slow network client manages to send or receive some traffic.  In other words: Diesel is not really async.  It just fakes it convincingly, most of the time.

Once again, there's no way to asynchronously spawn a process, and no way to asynchronously connect a TCP client.  Sure, this looks like an asynchronous connect call, but it's misleading: it blocks on resolving the hostname, and it potentially blocks on the initial SYN/ACK/SYN+ACK exchange.  There's no asynchronous SSL support.  And no, that is not trivial.  Not to mention handling all the crazy errors that spew out of the Windows TCP stack.  And since the loop is implemented to be incompatible with Twisted, it's not obviously trivial to compatibly plug it in and get those features.

Again, I don't want to dump on Diesel here; for what it is, i.e. an experiment in how to idiomatically structure asynchronous applications, it's all right.  For that matter Twisted has its fair share of bugs too, which would be pretty easy to lay out in a similar post; you wouldn't even need to do the research yourself, just go look at our bug tracker.

But both Diesel and Tornado make the mistake of attempting to replace the years of trial-and-error, years of testing discipline, and years of portability and feature work that Twisted has accumulated with a few oversimplified, untested hacks.

What they could have done is contributed any extensions that they needed to Twisted's loop, or modifications to Twisted's packaging that would allow them to get a smaller sliver of Twisted's core to bootstrap, if that's what they needed.

My goal in pointing out all these flaws is not to illustrate any particular point about Diesel, but to reinforce the point I implicitly made in my Tornado post, which is that if you try to write a new mainloop (especially without tests) you will screw it up.  You will most likely screw it up in ways which will only surface later, under mysterious circumstances, when your servers are under load and you are under the gun for a deadline.

Or if I happen to get wind of it and write a blog post about it, of course.  Then you get to cheat a little.

It's not an indictment of Diesel that it screwed this up; everyone screws it up.  I would probably screw it up, if I didn't have Twisted sitting in front of me as a direct reference.  POSIX by itself is unreasonably subtle and difficult, but POSIX, plus the subtle variations in different platforms which implement it, plus the Windows APIs which are almost-but-not-quite-exactly-nothing-like the POSIX APIs, presents an inhuman challenge.

Hopefully Diesel will grow some tests.  Hopefully it will fix, or better yet shed, its somewhat unfortunate I/O hub.  I am hopeful that someone will follow Dustin's excellent lead (perhaps Dustin himself!) and port Diesel's API and generator system over to Twisted's I/O architecture and eliminate all these silly bugs.  Of course, it someone did that, you could use Dustin's tornado port with Diesel.

With the silly bugs from the I/O loop out of the way, the Diesel team can write tests for the more interesting pieces, and fix the bugs which aren't entirely silly :-).

Making Twisted Specific

"pffft. twisted isn't specific."
          — W. Allen Short
The original goal of the Twisted project, as I have been frequently reminded of late, is to create a general, inter-operable mainloop that isn't specific to any particular protocol.  The main loop wasn't a goal in itself, as the point of making it general was to provide an opportunity for all protocols could have serious, production-quality implementations that any Twisted application could have access to.  Twisted itself ships with many different protocol implemenations in furtherance of this goal, in an attempt to get critical mass.

This generality is a great strength.  It means that we've attracted a small crowd of generalists.  We have an excellent development process, ever-increasing quality of both code and documentation, and a wide variety of different protocol implementations and libraries for doing common networking and inter-process communication tasks.  We have recently been lucky to attract a few more excellent developers to help with this.

The one thing we haven't been so lucky about is attracting specifists.  Although we still need more people to make Twisted awesome as a library, our community is getting better and better at doing that.  What we need even more than that is individuals with a very specific, focused interest on just one thing that Twisted does.  Czars, if you will, to push the development of Twisted as a suite of interoperating applications.

Twisted already has within it the seeds of excellent replacements for Apache httpd, OpenSSH, BIND, hybrid ircd, Sendmail, imapd, pop3d, and a few other servers, not to mention clients like Pidgin and the OpenSSH command-line client.  In order to sprout and take root, those seeds each need a dedicated advocate, someone who cares deeply about the experience of a user or administrator who just wants Twisted to perform one particular function and doesn't want to write their own application code to make it do that.

Projects like the ones above - OpenSSH and BIND, for example - have an advantage in becoming useful: they have dedicated people who care deeply about satisfying a particular use-case, and are singularly focused on that case.  Since they only have the one problem to worry about, they can give it a much more direct treatment.

However, given the team of infrastructure programmers already working on Twisted, such a focused individual would have an incredible force multiplier.  Consider the statistics on Conch from our 2003 USENIX paper on Twisted: going just by line count, Conch was 4x easier to write than even J2SSH, which was itself substantially smaller than OpenSSH.  It was 10x easier to write than OpenSSH.  So, with the support of Twisted as infrastructure, one Twisted application programmer can do the work of ten merely mortal ones ;-).

It might seem to those of you looking to write a chat client, DNS server, or whatever open-source giant that you want to do battle with, that Twisted is just a library, and you want to write an application.  But we really want twisted to be a comprehensive suite of applications, we're just stretched too thin already to make it realize that potential.

So please rest assured that we would love to have your help with turning Twisted itself into a worthy competitor for these open-source giants - or, for that matter, if you want to build your own competitor as a layer on top of Twisted (for whatever reason: you love .ini files and we don't, you want a more freewheeling development process, or you want a different shade of green on your web pages) we'd still love to help you out and support that effort by fixing whatever issues you have with Twisted's core or protocols.  There's even a super-project on Launchpad for Twisted-but-not-part-of-Twisted projects.  I invite all you application developers out there to join that group and help us with world domination.

(If all that stuff about being ten times more effective as a programmer wasn't enough for you, how about this?  On the Twisted Matrix Labs map of the post-revolutionary world, I'm pretty sure the Emancipated Territory of New Jersey is still missing an archduke and several viscounts.  I can't make any promises, but if you get in on the ground floor of this thing there's still a chance you could be a ruling member of the Twisted over-government!)

Saturday, September 12, 2009

The Hole At The End Of The Pipe

Matt Campbell, a long-time fan of my ramblings, pointed out a post from John Resig that reads almost like a response to my ideas about the browser as a deployment target, despite the fact that it was written several years ago.

While Mr. Resig isn't adamantly against "language abstractions" - he notes many of their benefits - his counterpoint is summed up in this paragraph:

In the case of these language abstractions you are gaining none of the benefit of learning the JavaScript language. When a leak in the abstraction occurs (and it will occur - just as it's bound to occur in any abstraction) what resources do you have, as a developer, to correct the problem? If you've learned nothing about JavaScript then you stand no chance in trying to repair, or work around, the issue.


This is becoming a popular fallacy in programming language circles; treating Joel Spolsky's "Law of Leaky Abstractions" as if it were an actual law.

Let's examine the metaphor of the "leak".  In plumbing, a leak is a hole in a pipe where water gets out.  Joel has noticed that every pipe has a hole in it, and therefore all pipes are leaky.

But that's not quite accurate.  There's another hole in pipes where water gets out: it's called the "faucet", and without that part, the rest of the pipe is pretty useless.  To say that a pipe whose faucet is turned on is "leaky" is somewhat misleading, just as it's misleading to say that an abstraction that propagates errors in its lower levels is misleading.  Joel's entire original essay is based on a subtle (and, I suspect, intentional) misunderstanding of TCP: the error conditions that result from failures in the lower level, unreliable packet delivery mechanism are not leaks in the abstraction, they are very carefully specified and thoroughly documented.  They are part of the abstraction.  The abstraction of TCP does not try to pretend that connections are never broken, it just provides a unified idea of a "broken connection" that is clearly specified so you don't need to understand the five million ways that packet delivery can go wrong.

Put more simply: there are abstractions which do not leak.  The example that Joel provides is one of them: TCP is a comprehensive abstraction.

Then there are abstractions which really do leak.  Every object-relational mapper that provides a facility where you need to directly execute SQL, for example, is leaking the SQL through the abstraction.  Every web templating framework where you can directly generate strings is leaky: the browser speaks DOM, and if you're generating strings, then bytes are leaking through the abstraction.

But "language abstractions" — or as those of us who are not hip to the new web lingo call them, "compilers" — are generally accepted to be the kind of thing that work well enough that you can trust them.  I don't know the specifics of the current crop of javascript-targeting compilers.  Maybe GWT and Pyjamas have issues that would require some knowledge of JavaScript to use them correctly.  A well-written compiler, one that really lived up to the promise of treating the browser as a deployment target, wouldn't have those kinds of issues though.  Let's turn the wayback machine to 1969 and cast Mr. Resig's argument against the contemporary contender for moving up the abstraction stack:

In the case of UNIX, you are gaining none of the benefit of learning the PDP-11 instruction set. When a bug in the C compiler occurs (and it will occur - just as it's bound to occur in any compiler) what resources do you have, as a developer, to correct the problem? If you've learned nothing about PDP-11 assembler then you stand no chance in trying to repair, or work around, the issue.


So, for those of you who work on UNIX-like operating systems using that fancy "C" machine-code abstraction: how much PDP-11 assembler have you written recently?

Tornado + Twisted

Many kudos to Dustin Sallings, who has already created a branch of Tornado which uses Twisted for both networking and HTTP parsing, in probably less time than it took me to write my previous post about how somebody should do that.  Awesome!

(The method it uses is currently a little weird, where you create a "Site" object, but it looks like it would be pretty simple to use a Resource instead if you were so inclined.)

Friday, September 11, 2009

What I Wish Tornado Were

FriendFeed has released its web server, Tornado.  It seems like everyone's blogging about it, and it's obviously relevant to my interests, so I feel like I should say something.

Let me start with the good stuff.  First of all, I think it's great that we have yet another asynchronous contender in the Python world.  Every time something like this comes out, it means that Twisted has to fight that much less hard to get over the huge hump of event-driven programming being too hard, or too weird, or whatever.  It's good to have an endorsement of the general message "if you need a web server to handle COMET requests, it needs to be asynchronous to perform acceptably" from such a high-profile company as Facebook.

Unfortunately I think the larger picture here is a failure of communication in the open source community.  In the course of developing Tornado, there are several things that FriendFeed could have done to move the Twisted community forward, at no cost to themselves.  I don't want to rag on FriendFeed, or Bret Taylor, or Facebook here; they're not the first to re-write something without communicating.  In fact I recently had almost this exact same discussion with another project that did the same thing.  Since Tornado is such a high-profile example, though, I want to draw attention to the problem so that there's some hope that maybe the next project won't forget to communicate first.

My main point here is that if you're about to undergo a re-write of a major project because it didn't meet some requirements that you had, please tell the project that you are rewriting what you are doing.  In the best case scenario, someone involved with that project will say, "Oh, you've misunderstood the documentation, actually it does do that".  In the worst case, you go ahead with your rewrite anyway, but there is some hope that you might be able to cooperate in the future, as the project gradually evolves to meet your requirements.  Somewhere in the middle, you might be able to contribute a few small fixes rather than re-implementing the whole thing and maintaining it yourself.

This is especially important if you are later going to make claims about that project not living up to your vaguely-described requirements, and thereby damage its reputation.  Bret Taylor claims in his blog:

We ended up writing our own web server and framework after looking at existing servers and tools like Twisted because none matched both our performance requirements and our ease-of-use requirements.

First and foremost, it would have been great to hear from Bret when he started off using Twisted about any performance problems or ease-of-use problems.  I'm guessing that Twisted itself had only ease-of-use problems, and other "tools like Twisted" were the ones with performance problems, since later, in a comment on the same post, he says:

I can't imagine there is much of a performance difference [between Twisted Web and Tornado].  The bottom is not that complex in my opinion.

It would also be great if he had explicitly said that Twisted didn't have performance problems rather than making me guess, because I'm sure that is what lots of developers will take away from this.  When you have the bully pulpit, off-the-cuff comments like this can do serious damage to smaller projects.

More to the point, what is the problem with "ease of use", exactly?  The fact that he found Deferred tedious, in particular, seems very strange to me, given that it is so un-tedious that it has become a de-facto standard even in the JavaScript community.  We had no opportunity to help him or anyone else out, because as far as I can tell from searching our archives, we never heard from him or from anyone else at FriendFeed when they were trying out Twisted at first.  Even as he's saying that Twisted is hard to use and (maybe?) performs poorly, he isn't pointing to any particular example of what about it is hard to use, or what performs poorly.  There's still nothing we can do to address this criticism.  And there's still not much we can do to make sure that future potential Twisted users won't have this problem.

Later, in yet another comment, Bret points out the root problem:

... the HTTP/web support in Twisted is very chaotic (see http://twistedmatrix.com/trac/wiki/WebDevelopme... - even they acknowledge this)...

This is true.  However, as I frequently like to note, Twisted is starved for resources.  Reconciling the chaos described on the page about web development with Twisted is an ongoing process.  For a tiny fraction of the effort invested in Tornado, FriendFeed could have worked with us to resolve many of the issues creating that chaos.

This is the main thing I want to reinforce here.  If half a dozen occasional contributors with a real focused interest in web development showed up to help us on Twisted, we'd have an awesome, polished web story within a few months.  If even one person really took responsibility for twisted.web, things would pick up.  But if everyone who wants an asynchronous webserver either uses twisted.web (because it's great!) without talking to us or decides not to use it (because it doesn't meet their unstated requirements) without talking to us, it's going to continue to improve at the same sluggish pace.

Even at the current rate, by the time we have an excellent HTTP story, I somehow doubt that Tornado will have a good SSHv2 protocol story ;-).

In his comment, Bret also takes a couple of pot-shots at Twisted that I think are unnecessary, and I'd like to address those too.

In general, it seems like Twisted is full of demo-quality stuff, but most of the protocols have tons of bugs.

We're not talking about "most" of the protocols here, Tornado is only concerned with HTTP.  And the HTTP implementation(s) in Twisted do not have "tons of bugs".  They are production quality, used on lots of different websites, and have lots of automated tests.  While much of the code in twisted.web doesn't have complete test coverage, since it's old enough to predate our testing requirements, I note that Tornado appears to have zero test coverage.

There's a kernel of truth here — some of the older, less frequently used protocols have a few problems — but in most cases the "bugs" are really just a lack of functionality.  Twisted overall has very few protocol-related bugs, and again, our test policy makes sure that we have get new bugs very rarely.

Given all those factors, it didn't seem to provide a lot of value. Our core I/O loop is actually pretty small and simple, and I think resulted in fewer bugs than would have come up if we had used Twisted.

I must respectfully disagree.  Again, I don't want to rag on FriendFeed here, but here are several features that Tornado would have, and bugs that it wouldn't have, if it used Twisted for the event loop and none of the HTTP stuff:
  1. EINTR wouldn't cause your application to exit if run in a non-US-english locale.
  2. You don't have the opportunity to forget to set a socket to be non-blocking and thereby make your entire application stop.
  3. It would be possible to run your application on Windows.
  4. Firewalled connections and running out of file descriptors wouldn't cause your server to spew errors forever (at least, it won't any more).
  5. You could write a TCP client that didn't block for an arbitrary amount of time in connect().
  6. Finally, of course, you could use all of Twisted's other protocols, client and server: IMAP, POP, SMTP, IRC, AIM, etc.  You could also use external protocol implementations like Thift.
  7. You could spawn asynchronous subprocesses.
and this is a very short list, based on a cursory reading of the source code, not actually running tornado and not a particularly deep audit.  Some of these bugs might not be as serious as I think, and there might be plenty of other bugs.  But I can't really be sure what works for sure, since again: there are no automated tests.

This list is a great example of why projects like Tornado really should use Twisted.  Tornado implements some innovative web-framework stuff, but absolutely nothing interesting that I can see at the level of async I/O.  Using Twisted would have allowed them to focus exclusively on cool web things and left the never-ending stream of incremental surprising platform-specific, only-happens-in-weird-situations bugfixes to a single, common source.

What To Do Now

I hope that someone at FriendFeed will be a little heavier on detail and a little lighter on FUD in some future conversation about Twisted.  However, I'm sure they're going to have their hands full maintaining their own code, so I don't have high expectations in this area.  I'm sure Bret wasn't intentionally slamming Twisted, either; it wasn't like he wrote a big screed about it, he just dropped in a few unsubstantiated comments into a much larger post about Tornado. So I just want to be clear: I don't have sore feelings, I don't need anybody to apologize to me or to Twisted.

If any of you out there are fans of both Tornado and of Twisted, it would be great if you could contribute a patch to Tornado which would allow it to at least optionally use Twisted as an I/O back-end.  It would be great, of course, if lots of people interested in web stuff would help us out with our web situation, but supporting the Twisted event loop would be good regardless. It would mean that when people wanted to speak multiple protocols, they wouldn't need to re-write or kludge in their existing Tornado application, so it would increase the chances that we could get some help with our SSH, FTP, IRC, or XMPP code instead.  It would also open up a much wider multi-protocol landscape to users of Tornado, even if Tornado's default mode of operation still used ioloop.py.

Even better would be to hook up something that made a Tornado IResource implementation, so that Tornado applications and twisted.web and Nevow applications could all be seamlessly integrated into one server.

The whole point of Twisted is to have a common I/O layer that lots of different libraries can use, share, and build on, so that we can solidify the common and highly complex abstraction required of a comprehensive, cross-platform, event-driven I/O layer.  In order to realize that vision, we need help not just with the code; we need more Twisted ambassadors to go out into the community and help us integrate these disparate applications, help us find out where real users are finding the documentation inadequate or the organization confusing.

Tornado could be an excellent opportunity for those ambassadors to go out and introduce others to the wonders of Twisted, because its endorsement from FriendFeed guarantees it an audience of a tens of thousands of developers, at least for its first few months of life.  If you've shied away from contributing to Twisted itself because of our aggressive testing and documentation requirements, well, Tornado apparently doesn't have any, so it would be a great place for you to start :).

The Web, Untangled

In my previous post, I outlined some reasons that web development is worse than other kinds of development (specifically: traditional client-server development).  I left off there saying that I had some prescriptions for the web's ailments, though, and now I'll describe those.  Given that we're stuck with web development for the forseeable future, how can we make it a tolerable experience?

First, let me tell you what the answer isn't.  It isn't a continuation of the traditional "web framework" strategy.  These have been important tools in dredging the conceptual mire of the web for useful patterns, and at this point in history they have a long life ahead of them.  I'm not predicting the death of Django or Rails any time soon.  Django and Rails are the stucco of the web.  An important architectural innovation, to be sure: they let you cover over the materials underneath, allowing you to build structures that are appealing without fundamentally changing the necessarily ugly underpinnings.  But you can't build a skyscraper out of stucco.

As Jacob covered in great detail in his talk, innovations in the "framework" space generally involve building more and more abstractions, creating more and more new concepts to simplify the underlying concepts.  Eventually you run out of brain-space for new concepts, though, and you have to start over.

I started here by saying that we're stuck with the web.  If we can understand why we're stuck with the web, we can make it a pleasant place to be.  Of course everybody has their own ideas about what makes the web great, but it's important to remember that none of that is what makes the web necessary.

What makes the web necessary is very simple: a web browser is a turing complete execution environment, and everyone has one.  It's also got a feature-complete (if highly idiosyncratic) widget set, so you can display text, images, buttons, scrollbars, and menus, and compose your own widgets (sort of).  Most importantly, it executes code without prompting the user, which means the barrier to adoption of new applications is at zero.  Not to mention that, thanks to the huge ecosystem of existing applications, the user is probably already running a web browser.

I feel it's important to emphasize this point.  When developing an application, delivery is king.  It doesn't matter how great your application is if no users ever run it, and given how incredibly cheap in terms of user effort it is to run an application in a web browser, your application has to be really, really awesome to get them to do more work than clicking on a link.  I can't find the article, but I believe Three Rings once did an interview where they explained that some huge percentage of users (if I remember correctly, something like 90%) will leave immediately if you make them click on a "download" link to play the game, but they'll stick around if you can manage to keep it in the browser without making them download a plugin.

Improvements to ECMAScript and HTML sound fun, but if, tomorrow morning, somebody figured out how to securely execute x86 machine code on web browsers, and distribute that capability to every browser on the internet, developers would start using that almost immediately.  HTML-based applications would slowly die out, as their UIs would be comparatively slow, clunky, and limited.

Tools like the Google Web Toolkit (and Pyjamas, its Python clone), recognized this fact early on.  They treat the browser as what the browser should be: a dumb run-time.  A deployment target, not a development environment.  Seen in this light, it's possible to create layers for integration and inter-op above the complexity soup of DOM and JavaScript: despite the fact that the browser itself has no "linker" to speak of, and no direct support for library code, with GWT you get Java's library mechanism.

Although it's not particularly well-maintained, PyPy also has a JavaScript back-end, which allows you to run a restricted subset of Python ("RPython") in a web browser; I hope that in the future this will be expanded to give us a more realistic, full-featured Python VM in the browser than Pyjamas' fairly simplistic translation currently does.  In opposition to the "worrying trend" that Jacob noted, with individual applications needing to write new, custom run-times, they leverage an existing language ecosystem rather than inventing something new.

Using tools like these, you can write code in the same language client-side and server-side.  This simplifies testing.  You can at least get basic test coverage in one pass, in one runtime, even if some of that code will actually run in a different runtime later.  It simplifies implementation and maintenance, too.  You can write functions and then decide to run them somewhere else based on deployment, security, or performance concerns without necessarily rewriting them from scratch.

If toolkits like these gained more traction, it would go a long way towards interop, too.  It would be a lot easier to have an FFI between Python-in-the-browser and Java-in-the-browser than to try to wrangle every possible JavaScript hack in the book.  Similarly on the server side: once a few frameworks can standardize on rich client-server communication channels, it will be easier to have a high-level abstraction over those than over the mess of XmlHttpRequest and its various work-alikes.

There's still an important component missing, though.  Web applications almost always have 3 tiers.  I've already discussed what should happen on the first tier, the browser.  And, as GWT, NaCl and Pyjamas indicate, there are folks already hard at work on that.  The middle tier is basically already okay; server-side frameworks allow you to work with "business logic" in a fairly sane manner.  What about the database tier?

The most common complaint about the database tier is security.  Since half the time your middle tier needs to be generating strings of SQL to send to the database, there are a plethora of situations where an accidental side-channel is created, allowing users to directly access the database.

This is a much more tractable problem than the front-end problem.  For one thing, a really well-written framework, one which doesn't encourage you to execute SQL directly, can comprehensively deal with the security issue.  Similarly, a good ORM will allow you complete access to the useful features of your database without forcing you to write code in two different programming languages.

Still, there's a huge amount of wasted effort on the database side of things.  Pretty much every major database system has a sophisticated permission system that nobody really uses.  If you want to write stored procedures, triggers, or constraints in a language like Python, it is at worst impossible and at best completely non-standard and very confusing.  Finally, if you want to test anything... you're not entirely on your own, but it's going to be quite a bit harder than testing your middle-tier code.

One part of the solution to this problem comes, oddly enough, from Microsoft: LINQ, the Langauge Integrated Query component, provides a standard syntax and runtime model for queries executed in multiple different languages.  More than providing a nice wrapper over database queries, it allows you to use the same query over in-memory objects with no "database engine" to speak of.  So you can write and test your LINQ code in such a way that you don't need to talk to a database.  When you hook it up to a database, your application code doesn't even really need to know.

The other part of the solution comes from SQLite.  Right now, managing the deployment of and connection to a database is a hugely complex problem.  You have to install the software, write some config files, initialize your database, grant permissions to your application user, somehow get credentials from the database to the application, connect from the application to the database, and verify that the database's schema is the same as what the application expects.  And that's before you can even do anything!  Once you're up and running, you need to manage upgrades, schedule downtime for updating the database software (independently of upgrading the application software).  Note that the database can't be a complete solution for the application's persistence needs, either, because in order to tell the application where it needs to find the rest of its data, you need, at the very least, a hostname, username, and password for the database server.

All of this makes testing more difficult - with all those manual steps, how can you really know if your production configuration is the same as your test configuration?  It also makes development more difficult: if automatically spinning up a new database instance is hard, then you end up with a slightly-nonstandard manual database setup for each developer.  With SQLite, you can just say "database, please!" from your application code, specifying all the interesting configuration right there.

Finally, SQLite allows you to very easily write stored procedures and triggers in your "native" language.  You also don't need to quite as much, because your application can much more easily completely control access to its database, but if you want to work in the relational model it's fairly simple.  The stored procedures are just in memory, and are called like regular functions, not in an obscure embedded database environment.

In other words, for modern web applications, a database engine is really just a library.  The easier it is to treat it like one, the easier it is to deploy and manage your application.

In the framework of the future, I believe you'll be able to write UI code in Python, model code in Python, and data-manipulation code in Python.  When you need to round a number to two digits, you'll just round it off, and it'll come out right.

Thursday, September 10, 2009

Oh <what> a.tangled {web, we} WEAVE FROM

The always entertaining Jacob Kaplan-Moss recently posted a missive, "Snakes on the Web", which, if you haven't already read it, is a highly edifying trip through a variety of Python web technologies and history.  He begins with a simple statement — "Web development sucks." — and goes on to ask a number of interesting questions about that.

What sucks about web development?  How will we fix it?  How has python fixed it, and how will python fix it in the future?  While I can't say I agree with every answer, I found myself nodding quite a bit, and he has something useful to say on just about every point.

I noticed one very important question he leaves out of the mix, though, which seems more fundamental than the others: why does web development suck?  In particular, why do so many people who are familiar with multiple styles of development feel like developing for the web is particularly painful by comparison, while so much of software development moves to the web?  And, why does web development in Python suck, despite the fact that otherwise, Python mostly rocks?

Programming for the web lacks an important component, one that Fred Brooks identified as crucial for all software as early as 1975: conceptual integrity.  Put more simply, it is difficult to make sense of "web" programs.  They're difficult to read, difficult to write and difficult to modify, because none of the pieces fits together in a way which can be understood using a simple conceptual model.

Rather than approach this head on, from the perspective of a working web programmer, let's start earlier than that.  Let's say someone approached you with a simple programming task: write an accounting system that includes point-of-sale software to run a small business.  Now, considering some imagined requirements for such a system, how many languages would you recommend that it be written in?

Most working programmers would usually say "one" without a second thought.  A too-clever-by-half language nerd might instead answer "two, a general-purpose programming language for most things and a domain specific language to describe accounting rules and promotions for the business".  Why this number?  Simply put, there's no reason to use more, and introducing additional languages means mastering additional skills and becoming familiar with additional quirks, all of which add to initial development time and maintenance overhead.  Modern programming languages are powerful enough to perform lots of different types of tasks, and are portable across both different computer architectures and different operating systems, so other concerns rarely intrude.

But, in the practical, working programmer's world, what's the web's answer to this question?  Six.  You have to learn six languages to work on the web:
  1. HTML.  This isn't really a programming language, but in web development you do end up reading and writing quite a lot of it.
  2. CSS.  In order to apply visual styles to your HTML so that it actually looks nice in a browser, you need to understand a different language (with a different conceptual model for how documents are laid out than the HTML itself).
  3. JavaScript.  In today's competitive AJAX-y world, you need to be able to react instantly in the browser, writing a real client application.
  4. SQL, so that you can store your data in a database.
  5. Your "middle-tier" language: in my case and Jacob's, that would be Python.  This is where people tend to spend the bulk of their programming time, but not all of it.
  6. A templating language; in Jacob's case, the Django template language.
If you're unlucky, you might need to learn XML, more than one back-end language, a deployment language (UNIX shell scripting or Windows's "batch" language), and ActionScript.  You'll probably need to learn a smattering of some awful web-server configuration language though, like the not-quite-XML-not-quite-HTML used to configure Apache.

Of course, Jacob lists a pile of related technologies too, and rightly points out that it's a lot to keep in your head.  But he is talking about a problem of needing extensive technical knowledge, something which all programmers working in a particular technology ecosystem learn sooner or later.  I'm talking about a different, more fundamental problem: in addition to the surface problem of being complex and often broken, these technologies are fundamentally conceptually incompatible, which leads to a whole host of other problems.  Furthermore, the only component which is really complete is the "middle-tier" language, although bespoke web-only languages like PHP and Arc manage to screw that up too.

Here are a few simple example problems that are made depressingly complex by the impedence mismatch between two of these components, but which are incredibly easy using a different paradigm.

How do you place two boxes with text in them side-by-side?  Using a GUI toolkit, like my favorite PyGTK, it often goes something like this:
left = Label("some text")
right = Label("some other text")
box = HBox()
box.add(left)
box.add(right)
The conceptual model here is simple: the HBox() is a container, the "left" and "right" things are widgets, which are in that container.  You can add them, remove them, swap them, or handle events on them easily.  You can discover how these things are done by reading the API references for the appropriate classes of object.  However, there's no right answer to this question on the web.  You can use a <table> tag, and then some <tr>s and <td>s to make a single-row table with two cells, but that has a variety of limitations; plus, it's considered somehow gauche by most web designers to use tables for layout these days.  Or, you could cook up a collection of CSS classes.  So there's the first impedence mismatch: do you do layout in HTML, or CSS?  Of course most design gurus would like to tell you that "always and only CSS" is the right answer here, but more practically-minded web developers who actually write code will often prefer HTML, partially because it's simpler but partially because CSS's featureset is incomplete and there are some things you can still only do with HTML, or only do portably with HTML.

Plus, how do you discover how these layouts work?  There are a variety of reference materials, but no canonical guide that says "this is exactly what a <table> tag should do, and how it should look".  There are different forms of documentation for both.

If you have a variable number of elements, you quickly run into another problem.  Should this be the responsibility of the HTML, the CSS, or some code (in the templating layer) that emits some HTML or some CSS?  Should the code in the templating layer be written as an invocation of your middle-tier language, or should the template language itself have some code in it?  Reasonable people of good conscience disagee with each other in every possible way over every one of these details.

This is all part of a very complex problem though.  For all of these crazy hoops you have to jump through, HTML and CSS do provide a layout model that allows you to do some very pretty and very flexible things with layout, especially if you have large amounts of text.  Perhaps not as good as even the most basic pre-press layout engine, but still better than the built-in stuff that most GUI toolkits allow you.  So there is an argument that this complexity is a trade-off, where you get functionality in exchange for the confusion.  So let's look at a much simpler problem.

Let's say that, in our hypothetical accounting application, you have a list of items in a retail transaction, and you want to process the list and produce a sum.  Where is the right place to do that?  It turns out you have to write the code to do that three times.

First, you have to write it in JavaScript.  After all, the numbers are all already in the client / browser, and you want to update the page instantaneously, not wait for some potentially heavily-loaded server to get back to you each time the user presses a keystroke.  And why not?  You've got plenty of processing power available on the client.

Then you have to write it in Python.  That's where the real brain of the application lives, after all, and if you're going to do something like send a job to a receipt printer or email a customer or sales representative some information in response to a sale, the number has to be located in the middle tier.

Finally you have to do it in SQL.  Since this is a traditional web application, your Python code is going to be spread out among multiple servers, and the database is the ultimate arbiter of recorded truth.  So you need to have transactions around the appropriate points and execute any interesting aggregate functions (such as SUM()) in the database tier.

So, you've got three times as much work to do in your fancy new web application as you would in a simple record-based application with a GUI.  A worthy price to pay to run in the brave new world of tomorrow rather than on some crusty old client/server system, right?

Well, as it turns out, the problem is somewhat deeper than that.  It turns out that JavaScript, Python, and SQL actually have slightly different numerical models (in fact Python implements at least 4 itself: fixed-point decimal, floating-point decimal, IEEE 754 floating-point binary, and integer math; you should really only use decimal for money, but this isn't availble in JavaScript and its availability in SQL is spotty).  After applying some discounts, your register might read $19.74 but your receipt will read $19.75; and the reports sent to the accounting department will read $19.74898989898989.

Even if you know a lot about math on computers, the limitations of each of these runtimes, and you happen to get all of that just right, you still have another problem to contend with: what happens when somebody else needs to change the logic in question?  How do you test that the Python, the JavaScript, and the SQL are all still in sync?  It's possible, but you have to go above and beyond the usual discipline of test-driven development, because you need to have integration tests that verify that different, almost unrelated code, in different languages, in different environments is all executing properly in lock-step.  Just getting the code from SQL and JavaScript to run in your Python test suite at all is a major challenge; in a language like PHP it's borderline impossible.

This is all even worse when it comes to security, because every part of the application exposes an attack surface, and because you can't use the same language or the same libraries to do any of the work, they all expose a different attack surface.

In his talk, Jacob notes that "frameworks suck at inter-op", but the problem is much deeper than that.  As I've shown here, a single page from a single application written using a single framework, which has only one task to do, can't even inter-operate with itself cleanly, at least not at the level that Jacob wants — or that I want.  He says, "gateways aren't APIs", and he's right: the correct way to inter-operate is through well-defined APIs.  APIs can be discovered through a single, consistent process.  Their implementations can be debugged using a single set of development tools.

CSS isn't an API.  HTML isn't an API.  Strings containing a hodgepodge of SQL and data aren't an API either.

It's not all doom and gloom, but my ideas for a future solution to this problem will have to wait for another post.

Monday, September 07, 2009

Threat 2: Attacks via E-Mail

Continuing my series on simple threat models for internet users, I'll now address the second threat I mentioned: threats via e-mail.

There are two kinds of e-mail attacks: direct attacks, and trojan horses.  First let's talk about direct attacks.

The basic idea behind a direct e-mail attack is that the program you use to read your e-mail might have flaws in it, which a specially-crafted message will exploit.  That message will have a program in it, and a mistake by the programmers who wrote your e-mail client will cause that program to be executed.

Unlike attacks from the outside, which you can very simply protect against by denying outside attackers access to your computer entirely, there is no fool-proof method to protect against this kind of threat.  E-mail formats are highly complex, and messages can contain multiple parts, including images, etc.  The code that decodes images is notoriously prone to security problems.  Even e-mail programs which don't process images are occasionally prone to security problems dealing with the structure of certain messages.

Chances are that you are going to want to read e-mail somewhere, and you probably want to be able to see images and download attachments; shutting off e-mail completely isn't really an option.  The more general advice I gave against the first threat still applies, though: keep all your software up to date, including your e-mail client.  People who make e-mail software take these kinds of threats very seriously and release updates very quickly when problems are discovered.

One way you can mitigate this risk, and reduce the amount of work required to keep up to date (and therefore the opportunity for you to forget to do so) is to use a web-based e-mail client like GMail.  If you use GMail, the potentially vulnerable program running on your computer is just your web browser, and you already need to keep your browser up to date for other reasons.  The code which deals with the structure of messages is all run on the server, and constantly kept up to date by the fine folks at Google.  Similarly, they take steps to protect your browser; stripping out harmful attachments and filtering spam for you so that potentially dangerous messages never reach you.

The much more common form of e-mail attack is easier to defend against, but is attacking something more potentially vulnerable than your e-mail software: it's attacking you.  A trojan horse is a program which doesn't do anything tricky to get itself run automatically, but instead elicits your cooperation in making it run.  Whether you run a web-based email client or the oldest, buggiest version of Microsoft Outlook, you are equally vulnerable to these kinds of attacks.

The key to defending yourself against a trojan horse is to understand what you are double-clicking on.  Look inside that trojan horse before you open it; there may be a bunch of armed greeks inside.  Before you open any document or run any program that was attached to an e-mail, very carefully read the message that it came from.  Ask yourself a few questions:
  1. Were you expecting this message?
    If you weren't expecting the message, you should double-check to make sure.  In the best case, use some mechanism other than e-mail to check.  Give the sender a phone call.  Ask if they actually sent you the message in question.
  2. Is the message really from who it says it's from?
    It's very easy to fake e-mail addresses, so if you are used to receiving messages from Bob Dobbs and you see "From: Bob Dobbs <bobdobbs@example.com>", you shouldn't necessarily believe it.  Does the text of the message read like Bob wrote it?  Does Bob usually send you these kinds of attachments?  Is the "To" line correct?  Does he use your real name?  A lot of spam which includes viruses is very generic, but it is increasingly cleverly disguised as coming from people in your addressbook.
  3. Is an attachment trying to disguise itself?
    Sometimes, even messages you are expecting, from people that you know, will contain evil attachments.  If Bob's computer is infected with a virus, he may well have actually legitimately written you the message but a trojan horse packed itself along for the ride.  In this case, you need to see if the attachment is trying to look like something different than it is.  Does the file's name have multiple extensions?  For example, "business-plan.doc" is a Word document, but "business-plan.doc.exe" is an executable program, with its name changed to pretend to be a Word document to fool you.
  4. Is anything trying to warn you?
    Most browsers and operating systems these days will double-check with you before opening executables which you've downloaded.  If a box pops up saying "Are you sure you want to do that?", don't just click past it immediately; read it completely and try to understand what it's telling you.  Even if you don't understand a word, pausing for a moment to reflect on whether the warning is serious or not will often help you realize that something might be amiss.
If you're careful and look for details which seem out of place, you don't need to be an expert to spot e-mails that look wrong.  The most basic task here is to recognize genuine human communication, and not to scan for any particular technical trick.  That's not all, of course; as I mentioned, there are ways that programs can hijack legitimate communications, but these are much more sophisticated, and much rarer than the much more common type of message, which is one that simply says "hey buddy, click this" and expects you to click on it without thinking.  If you can recognize those you will be safe 99% of the time.

In using the Internet, this is a generally useful skill, and particularly important when it comes to security.  It will be particularly useful when I discuss threat #4, phishing attacks.


Friday, July 10, 2009

Goodbye, Divmod. Hello, World!

At the end of this month, Divmod will lay off its last employee and cease to be.

As some of you know, I've been on hiatus for several months now.  The idea was originally that I would take a break, allow the company to build up a small operating buffer to deal with our cash-flow issues, and heal a psyche damaged by many months of intense stress (caused largely by those same cash-flow issues).

The psyche-healing worked out okay.  I'm feeling much better than I was when my break started.  The cash-flow issues, not so much.  The reality turned out to be that much of the new consulting business we were counting on just didn't materialize.  We managed to get quite a bit of maintenance done on our infrastructure — I continued to help out intermittently, interleaving some reviews and bugfixes with hobby projects — but it was no longer really clear what business purpose that infrastructure was serving.  We didn't have any product that generated a revenue stream and we certainly didn't have the resources to build a new one.

Users of Divmod email: I'm not exactly sure what the plan is, but JP and I will personally make sure that you can get your email in some form and we'll work out some way to keep at least a forwarding service running.

Users of Divmod open source projects: we will figure out some way to continue to host and maintain the code.  I'm not sure what we're going to do about official stewardship, but it was years before Twisted needed any official legal structure, so I'm sure we'll make due.

The Divmod Fan Club, which deposits money into my personal paypal account rather than a business one (for stupid technical reasons which are now extremely convenient), is generating enough money that we may be able to afford some hosting, assuming those of you who supported Divmod-the-company would like to continue supporting Divmod-the-ambiguously-defined-collection-of-open-source-projects.  Regardless of whether you decide to cancel your subscriptions now (you can do so in the UI for your PayPal account; nothing to do with us, happily), thank you all, very much.  You enabled us to do a lot more with our open-source work than we would otherwise have been able to, and you helped the get through a number of crunches in the past.

The fan club might enable us to host the collection of open source projects, and possibly also host versions of Mantissa and Quotient, and Sine.  I think that having some users would help keep those projects alive in the absence of a corporate sponsor.  I'm not really sure what's going to happen to Blendix, though, and as a proprietary thing it requires more thinking.  If you care deeply about it, please get in touch with me.  Also, if you are a member of the Divmod community who might like to help out with administration, we might need help with mundane things like keeping our Trac instance running.

Now, on to the more personal stuff.

Thanks in advance for your condolances, but I'm feeling okay about this.  Not to say that I don't wish Divmod had ended with more success, but I spoke to Amir and JP yesterday, and we all agreed — it's time to move on.  We tried everything we could think of.  It's time to do something different.

More importantly, I'm not really sure what I'm going to do next.

Right now I'm considering a few things.  I have a couple of job offers, I have a few ideas for new businesses that I might want to start myself.  Some of those ideas are things I would bootstrap myself, some would require funding.

Some of you reading this right now have intimated that you'd like to offer me a job, if I were available.  Some have speculated that you might want to fund some other company that was less ambitious than Divmod.  Well, now's your chance.  Get in touch, and let's talk.

If you can, please do it soon, though.  Some of the offers I'm already considering need a decision soon, but I'd really like an opportunity to consider my options before I jump into the next thing.


Thursday, July 09, 2009

Threat 1: Attacks From The Outside

This article continues my series on my personal threat model for the internet.  In this article, I'm going to talk about the threat of automated attacks coming in to your computer over the internet, while it is connected to the internet.

The basic problem underlying this threat is the same as that underlying threats #2 (malicious e-mail messages which attack your e-mail program) and #3 (malicious web pages which attack your web browser): the software you are running on your computer, which you need to do your job, play your games, or otherwise get value out of your computer, is full of bugs.  Some of those bugs are security problems.  The most dangerous type of security problem is one that allows some data which a program is reading, which is supposed to just be processed by the program, to overwrite portions of that program's memory such that it takes over the program.  That data is then itself a program, and can take over your computer.  Unfortunately, this type of problem is very common.

The first thing you need to do to protect against these threats is to regularly install security updates for your computer.  On Windows you can do this by using Automatic Updates, on MacOS X it will be done for you by Software Update, and on Ubuntu, Update Manager.

When updates are available, make sure to install them as soon as you can!  By the time an update is available, the problem that the update is intended to fix has often been made public already.  The publication of the problem allows the update to be created in the first place, but it also allows malicious individuals to create attacks from it.  The longer you wait, the longer you are vulnerable to problems which have been made public, and thus can be exploited by the largest population of attackers.

However, even if all of your software is fully up-to-date, it still isn't perfect.  The general strategy for dealing with this type of problem, then, is to make sure that only data from sources you trust will ever be allowed into that software.  This limits your exposure to attacks.

In later posts I'll talk about limiting your exposure to malicious data that you have specifically requested, but right now I'm just going to talk about preventing unsolicited data getting to your computer directly over the internet.  The best way to do this is to get a commodity hardware router, and put it between your computer and the internet.  Devices such as this are made by vendors such as linksys, belkin, buffalo or netgear.

You don't need to get a router with fancy "security" features like an "SPI firewall" or "intrusion detection".  In my opinion these features don't add a lot - in fact, they will often cause difficult-to-diagnose problems for home users.  Of course, the people who sell these devices love to put the word "security" on the box as many times as possible, but you really only need the most basic security feature, and that's the one that isn't really a "security" feature at all.

The basic feature that a router adds is a separate layer of protection, independent from anything you can do to your computer itself.  If your home computer is hooked up directly to the internet, it looks like this:



That is, whenever your computer tries to contact another computer on the internet, it sends a request directly via your modem.  Whenever another computer tries to connect to you, it goes directly to your computer.  This means that if there are programs that you don't know about, which your operating system vendor, or some application has left running on your computer, anyone on the internet will be able to access them.

If those programs were all perfectly secure, that would be fine.  Unfortunately, programmers make mistakes, and mistakes lead to bugs, and bugs sometimes lead to security problems.

When you have a router, the picture looks more like this:



which is to say, when your computer submits a request to another computer on the internet, the router sees that the request is coming from inside the network, and transparently forwards it to the outside, establishing a channel of communication.  However, when another computer tries to talk to the IP address that your ISP gives you, the device they find is the router.  The router itself is a very simple device, and, unless you've done something unusual to it, will never be running any programs beyond the ones necessary to move traffic between you and your network.  Because one of the functions of a router is to allow multiple computers on your home network, when connections come in from the internet, the router doesn't know which computer it should go to, even if you only have one.  So the incoming connection will be refused, never having a chance to get to your computer.

This is preferable to running "firewall" software on your computer, for two reasons:
  1. Firewall software is still running on your computer, and thus on your operating system.  If your operating system itself has a flaw in it, the firewall can't protect you.
  2. Software which listens for incoming connections is doing so for a reason.  Different components of the same program will sometimes communicate with each other over a network connection internal to the same computer - as a user of those programs, you really shouldn't need to know this.  Firewall software will present you with prompts to allow or deny permission for programs: these prompts often boil down to "do you want this to work?"  If you say yes, your computer will be exposed to a potential threat, if you say no, the program will break.
Of course, if you've prevented other people's computers from accessing yours, there are some programs which will now be unable to connect to your computer.  BitTorrent, for example, is notorious for performing poorly if other users can't connect to you directly.  Certain voice-over-IP programs will also have problems.  To address this, you can add rules to your router to allow specific incoming connections, without opening the floodgates to everything.  This is referred to as "port forwarding", and portforward.com is a good resource.  If installing a router causes any problems with network applications that you use, consult their documentation: port-forwarding issues are usually prominently covered early on.

Friday, June 26, 2009

My Threat Model

As a "computer guy", I am sometimes called upon by friends and family to opine on what makes a computer or a network secure.  Many of my colleagues are in the same situation.  As a "networking guy", I get similar questions from even from experienced "computer guys".

Users have very peculiar ideas about security.  Users — and I include myself in this grouping — will become confused even in areas of the computing experience where billions of dollars have been spent trying to make the experience as easy and comprehensible as possible.  So it stands to reason that users will often be confused in the area of security, by its nature the least usable and comprehensible area of computing.  Attacks are arcane, and, by definition, unexpected ways that software can be manipulated.  Yet, these attacks are very relevant to users, who want to understand what, exactly, they are vulnerable to and how to defend against it.

It's basically impossible to try to understand computer security this way, let alone explain it.

The important thing to remember in any security situation is this: what do you have of value, and what is the threat to it?  Computer security professionals call the answer to this question the "threat model".  Stephen Colbert calls it the ThreatDown.  No matter what you call it, it's important to enumerate the threats that you're defending against.  Any security measure that you take which is not designed to protect you from a threat which you can, at the very least, imagine and describe, is just extra cost.

In my case, people ask me about three broad classes of user:
  1. users who have networked computers in a home, and use them for checking email, browsing the web, online shopping, and games,
  2. users who have networked desktop computers in a business, and use them for email, web, and business applications, and
  3. users who have networked server computers that are running server applications.
These users all have roughly similar threat models, so I'm going to lump them together for the sake of simplicity, with a nod to a few specific situations.

I believe there are five major types of attacks which threaten average users on the internet today.
  1. Automated attacks that attempt to connect to your computer and exploit a flaw in its operating system or in software that is running a server, and install malicious software on your computer.
  2. E-mail attacks, which attempt to deliver a message which will exploit a flaw in your desktop e-mail client to install malicious software on your computer.
  3. Browser attacks, which attempt to get your browser (either with or without your consent) to visit a site which will exploit a flaw in your browser software to install malicious software on your computer.
  4. Phishing attacks, which attempt to convince you to disclose information about yourself, such as bank account numbers, passwords, or personal details that can be used to access those other things.
  5. Snooping attacks, which attempt to read information in transit between you and another computer.  Usually snooping attacks read passwords in an attempt to allow the attacker to impersonate you later.
Attacks 1-3 are all based on the same premise: software is flawed, and sometimes the flaws in it can be exploited to get it to do things that it should not do.  There are multiple resources under threat here: your computer itself (i.e. its processing power), your network connection, and the data stored on your computer.

Attacks 4 and 5 are in a different class.  They're attempting to get you to reveal information over the network, either with or without your knowledge.  The resource under threat here is the information you are transmitting - in most cases, the information being sought is a token which allows you access to some resource; anything from a username and password to your facebook account (which allows for stealing your personal information or impersonating you) to a debit card number (which allows attackers access to the money in your bank account).

I have fairly simple ways to protect yourself against each of these types of attack.  In a series of follow-up articles, I'll cover each of those strategies.  They should cover a wide variety of attacks with a minimum of effort and cost.  Of course, these defenses aren't perfect.  It's possible that someone who knows much more about security than I do will correct me, but if so, that's so much the better.

More importantly, I will try to provide simple abstractions that allow you to reason about each type of attack without understanding the intricacies of the technology involved.  A major reason I've decided to try to write about this is that security vendors play upon the intuitive (and wrong) understanding that most people have about computer security: equating it with physical security, making their security widget the digital "lock" for the digital "house" of your computer.

I am targeting this series at a fairly nontechnical audience.  I realize that my audience here mostly rates pretty high on the nerd spectrum; my hope is that you will agree with what I say sufficiently that this will be a useful resource for you to refer your less technical friends and family.  To maintain your interest, however, I'll also be embedding some details about the reasoning behind my own security practices.  See you next time!

Update: I accidentally posted a draft of this rather than a final copy; some of the sentences and paragraphs were incomplete.  I hope that I've now corrected this.

Tuesday, June 23, 2009

A Chicken in Every Pot and a Python on Every Port

Twisted Matrix Labs is bent on world domination.  We spend so much time working at the level of fine-grained minutæ that we sometimes forget the overarching plan.  So here's a step back: what is Twisted for?

Most people know at least part of Twisted's origin story.  I was working on a text-based game, and I wanted a networking layer, and discovered that there was really nothing available.  I decided to write something general to base the game's networking core on, so I would be able to use production-quality protocols rather than toy "just for this game" stuff.

However, it wasn't just about the game.  That's a good thing, too, because the game has been falling behind quite a bit.  My game was just one example of code that you might want to write that could talk to a network, and my frustration was that despite large amounts of code being written to talk to networks, very little of it was directly usable by other code, and even less of it could be combined.  A major culprit here is that most networking software is written in C, where there is a stark contrast between "application" and "library"; a conscious, deliberate effort has to be made to expose functionality as a library, both in the code and in the build process.

Then there's the security situation.  A 2007 analysis of different types of vulnerability reports that buffer overflows were only recently overtaken by web application attacks, but are still the #2 for vulnerabilities overall, and #1 for OS vendor advisories.  Again, why is everybody still using all this network software written in C?  You can't even have a buffer overflow in most high-level languages.  (The even more depressing thing here is that, as the web development community has moved to higher-level languages, the majority have moved to the worst possible high-level language.  The vulnerability listings for web applications in that same report mostly have to do with flaws in PHP.)

So, the goal of Twisted is to provide a high-quality, high-level, secure implementation of every protocol spoken on the Internet.  We've achieved a lot, but there's still a long way to go.  Netcraft no longer seems to have any data on Twisted, because it is too low in the "other" category.  There's no site I'm aware of that does server market-share for DNS servers, but I'm betting that Twisted remains low in this category as well.

I believe Twisted remains popular in a growing segment of the network applications market, that is to say, applications that don't fit neatly into a single protocol.  If you want to control DNS, HTTP, SIP, and XMPP from a single program, it's far easier in Twisted than in anything else.  However, I think we can do better.   I want Twisted to take on BIND, Apache, Asterisk and jabberd directly as a server in its own right, not an integration mechanism or library.

One major area where Twisted is lacking is in focused, purpose-specific developers.  Apache has lots of people who are only interested in HTTP, Asterisk has people who are only interested in SIP, and libpurple has lots of people who are only interested in chat.  Twisted, by contrast, has excellent generalists, but few individuals to focus on the individual details of a single application.  I'm not sure how to recruit people who have that kind of monomaniacal focus to maintain individual components.  I think it's the details that such people would notice which is holding us back from being more competitive in the general server "market", such as it is.

This is a chicken-and-egg problem.  People interested in chat clients will often find libpurple before they find Twisted Words; people interested in web servers will often find Apache before they find Twisted Web.  Part of this is the lack of relevant conveniences and features, but probably an even bigger part is just our lack of a coherent web presence for those interest groups.  While I think that a lot of people looking for these things would be delighted to find something as easy to script and re-shape as Twisted is, they don't start out by looking for an omni-server platform.

So go update the web site, and take over the world!


Tuesday, June 16, 2009

Why Phones Lost

This morning I was reading Antonio Rodriguez's "Path Dependence And Smartphones" post, where he muses about different perspectives on the "smartphone" market; in particular the European / American divide outlined by Tomi Ahonen in "A Tale Of Two Smartphones".  Antonio tends to think a few years ahead of his time, so I'm always interested in his take on trends like this.  It seems that he's very cautiously optimistic that the "user customized mobile computer" thing is an important trend, but he also notes that Mr. Ahonen believes that the "[smartphone] operating system and any applications had ZERO bearing on the decision [of which phone to buy]. Not for mass market consumers"; and maybe we're looking at this the wrong way.

As I started thinking about my own reactions to this, I realized: I've heard this tune before.  Remember when pundits used to talk about "convergence" between television and computers?  Since the advent of the computer, futurists have been predicting the dawn of a strange new device: part computer, part television, part telephone, part vacuum cleaner.  What would it look like?

Well, a few months ago, I feel like Paul Graham answered that question pretty definitively.  For years, we've wondered what you would get if you mixed computers and televisions.  In Mr. Graham's words: "We now know the answer: computers."

As a child of the digital age — I've been using computers with keyboards, mice, color displays, and networking almost as long as I've been able to read — I always found this conclusion somewhat obvious.  A few of the early computers that I had the opportunity to use, an Atari 800 and an Amiga 1000, both used televisions as monitors, so I have always thought of a television as an output device — you could plug it into a VCR, a computer, or a cable box, but fundamentally it was just a bag of pixels.

I remember the exact moment that it dawned on me that computers were going to take over from TV: I was 14 years old, playing Myst for the first time, and monkeying with the configuration of system extensions that were loaded on my computer in order to squeeze the last few ounces of performance so that the video clips in the game would play smoothly.  I remember thinking, "This is just a problem with RAM and CPU.  In a few years computers will have so much of both that you'll be able to play full screen video without even turning off any extensions."

I, uh, had a pretty limited idea of how optimization worked at the time (the video was still jerky even after I turned off all my extensions), but I am frequently reminded of this insight when I am watching YouTube movies on my LCD "television".  That television, by the way, is just a monitor for a computer that runs Ubuntu so I can watch Hulu and YouTube.  I think maybe I have cable bundled with my internet service, because it's cheaper that way but I've never plugged it in to anything.

I didn't realized how powerful articulating this particular idea is until recently though, because I didn't realize just how much money is spent protecting obsolete infrastructure from the relentless onslaught of microprocessor technology.  Phone companies — which, increasingly, are combination cable/phone/internet companies — are stuck between a rock and a hard place.  As Internet service providers, they are a facilitator of the transition, and make a huge amount of money selling network services to people to make their computers more useful.  But, as cable companies, they want people to think that television is some special, extra expensive thing that needs to be delievered over a different cable.  As phone companies, both wired and wireless, they want people to think that voice and SMS data are special, extra expensive things that need to be delivered via special, magical wireless signals that can't be reduced to the simple and banal "internet".  At the same time, especially as wired phone companies, they want the cost savings that comes from doing all of their networking as plain old IP, with no actual pesky phone circuits to worry about.  Except they still want to sell you the service as if the phone were a different thing from your "internet" connection.  (Whenever I see an ad for Comcast Digital Voice, I can't help but think, "Do you think that's air you're breathing?".)

There's still a lot of speculation in each of these industries that some new, hybridized technology is going to create a special and unique relationship with the consumer.  But that's one thing Mr. Ahonen got right: the consumer doesn't care about your "operating system".  They don't care about your "applications".  They just care what they can do with their technology, and they care how much it costs to do so.  The thing is, computers do more, and cost less, than any other specialized, dedicated technology.  If your industry is fighting computers in the hopes of holding on to some residual value, you are going to lose.  Here's a simple formula:

Computer + X = Computer

Consider a few specific examples: the convergence of computers with television has resulted in three general categories of technology: YouTube (and other flash video sites, such as Hulu), Tivo (and other DVRs), and digital cable boxes with on-demand technology.  YouTube is a program you run on a computer to watch videos.  A Tivo is a computer (running linux) that is running a program to let you watch and record videos.  And those cable boxes are computers (running some crappy cut-down embedded OS) that let you watch videos on the cable company's terms.  Whether or not your customers care about choice, all these things are computers because it's fundamentally cheaper and easier for the vendors to produce these things out of commodity PC components rather than specialized "media" electronics.

But Mr. Graham neatly outlined that trend already, so let's move on to other industries.  What happens when you add a computer to an accounting ledger?  You get a computer program (like BusinessMind, or QuickBooks) which lets you do accounting on your computer.  Computers and books?  The Kindle, which is a hand-held computer that lets you read books.  If you look a bit deeper, you'll find that the Kindle is actually a computer program that can run places other than its dedicated device.  Only crafty marketing folks prevent it from being more widely accessible; say, on your desktop or "television".

Let's get to the point of this whole schpiel: phones.  Phones are already computers, pure and simple.  They are just small computers with microphones and speakers, and soon, cameras and screens.  You can look at the exciting developments in the world of phones and see that this is so.  What are the hottest phones of the last few years?  The iPhone, which is a small Macintosh computer, and the G1, which is a small Linux PC.  Microsoft would have you believe that their small Windows PCs are equally relevant, even if they are clearly an also-ran in this category.  (Disclosure: I actually have a Windows Mobile phone, and I'm fairly happy with it, but I'll be glad when I can finally ditch it for Android.)  None of these "phones" does anything interesting in the area of phone-ness.  They don't have particularly awesome voice quality or particularly awesome reception or even particularly awesome voicemail, although the iPhone certainly raised the bar.  They're just better computers than the previous generation of "phones"; computers that can run a wider variety of programs.

However, phones are still computers with weird restrictions, restrictions that are purely a function of the "path dependence" that Antonio mentions, which dragged them out of the muck and the mire of the telecom industry.  SMS is my favorite example of this: 10¢ to send a 140 character message.  How much does a tweet cost on twitter?  How much does an instant message cost on AIM, or Google Talk, or any IRC network you please?  If you were billed at SMS rates to read this post, it would have cost you $10; the cost of a decent paperback.  I know I'm wordy, but I'm not that wordy.  If you were charged at SMS rates for a day's worth of casual web browsing, images and all, you'd probably have to take out a mortgage just to pay for it.  Phone companies have been able to sustain the myth that SMS data is somehow special and deserves to be treated as sacred and precious, fully 1000 times more expensive than the regular bytes you get off the internet, even at the obscene prices they charge for usage-based data plans.

SMS is particularly egregious, but voice isn't that much different.  Phone companies charge such ridiculous rates for "voice" data that Skype built an entire profitable business around giving people the same service for free, and only making money by piggybacking on the phone companies' greed and charging you when sending voice messages over phone networks rather than the internet.  I can't imagine casting the wasteful overhead of legacy phone networks in any sharper relief.

So, we're not there yet, but the market pressure is tremendous to treat data as data, regardless whether it's voice, or SMS, or IM, or "internet" (in other words: everything else, including voice and SMS and IM messages which are sent via different mechanisms).  Until the advent of the recent crop of smartphones, it was difficult and expensive to get an unlimited data plan.  Now, unlimited data plans are the norm, except for "tethering" - using your phone as a proxy for your laptop.  The phone companies are still desperate to convince you that you should pay $60 per month for the privilege of having a USB dongle that you can plug into your laptop rather than just using the mobile IP endpoint — which, by the way, probably aleady has a USB port — that's already in your pocket.

The "mass market" user might not care about operating systems or APIs, but they do understand that a bill with seventeen different break-out metered sections is a bald-faced attempt to rip them off, and a flat-rate or easy to understand pay-as-you-go plan with one number on it is better.

To the extent that phones are not yet interchangeable, unrestricted mobile IP endpoints, it is due to the high barrier to entry to telecom providers, lack of regulation of misleading pricing schemes, and the symbiotic relationship between government and the telecom industry.  However, if one wireless carrier moves to provide simpler billing with more features, the others are forced to follow suit - even more so than cable companies and land-line providers, who can hold their customers hostage via development deals with local governments.  So, this progression is happening, albeit slowly.  For example, when AT&T introduced its iPhone plans, many of the other metered PDA and Blackberry plans, both on AT&T and other providers, began receding from their marketing materials.

Fifteen years ago ... ugh, I feel old.  Let's say ... ten years ago, my computer was barely powerful enough to dedicate all of its processing power to playing one low-resolution movie that took up maybe half the screen.  I was still paying for internet over a phone line with a cap on the number of hours I could use it.  Today, I have real-time two-way video connection to anywhere in the world, 24-7, for a single flat rate.  I own a device that fits in the palm of my hand which contains days worth of continuous music, a library of dozens of books, and connects to the internet.

So, back to that "mass market consumer".  Maybe they don't care about my Python console or IRC chat or SSH access applications, but most "mass market" people do listen to music and read books.  And they're going to care about those features being on their phones, and remaining cheap enough that they can use those features without worrying that they'll go broke if they feel like changing out their playlist.  Also - nobody is really a "mass market" consumer, anyway.  You might not be technical, but maybe you're a golfer, or a swimmer, or a finance nerd.  You want to be able to check the weather on your mobile, or update your latest personal best lap time, or get updates when stocks hit certain price threshholds.  Nobody cares what APIs these apps use, or even whether you call them "apps", but everybody has one extra thing they'd like their mobile to do.

The increasingly ubiquitous, user-customizable, network connected, commodity pocket computer is exactly the technology that is going to deliver that.  It's going to have to become commoditized, which means it's going to be standardized, and secured, which means it's not going to be locked up in carrier notions of what's a "text message" and what's a "voice call" and allow for precise price segregation of every different type of data.

In the future, almost every device will be a computer, albeit with specialized peripherals to assist with performing tasks.  If we're lucky, they will be networked together in standard ways to allow us to control all of them in a consistent and convenient way.

This progression towards computers is good for all of us.  Trust the computer.  The computer is your friend.