Static On The Wire

Friday July 04, 2008
I am, as you might have guessed, a big fan of dynamic typing.  Yet, two prominent systems I've designed, the Axiom object database and the Asynchronous Messaging Protocol (AMP) have required systems for explicit declarations of types: at a glance, static typing.  Have I gone crazy?  Am I pining for my glory days as a Java programmer?  What's wrong with me?

I believe the economics of in-memory and on-the-wire data structures are very, very different.  In-memory structures are cheap to create and cheap to fix if you got their structure wrong.  Static typing to ensure their correctness is wasted effort.  On the other hand, while on-the-wire data structures (data structures which you exchange with other programs) can be equally cheap to create, they can be exponentially more expensive to maintain.

When you have an in-memory data structure, it's remarkably flexible.  It is, almost by definition, going to be thrown away, so you can afford to change how it will be represented in subsequent runs of your program.  So, when your compiler complains at you for getting the static type declarations wrong, it's just wasting your time.  You have to write unit tests anyway, and static typing makes unit testing harder.  What if you want a test that fakes just the method foo on an interface which also requires baz, boz, and qux, so you can quickly test a caller of foo and move on?  A really good static type system will just figure that out for you, but it probably needs to analyze your whole program to do it.  Most "statically typed" languages — such as the ones that actually exist — will force you to write a huge mess of extra code which doesn't actually do anything, just so all your round pegs can pretend to fit into square holes well enough to get your job done.

But I don't have to convince you, dear reader.  I'm sure the audience of this blog is already deeply religious on this issue, and they've got my religion.  I'm just trying to make sure you understand I'm not insane when I get to this next part.

The most important thing that I said about in-memory data structures, above, is that you throw them away.  It's important enough that I'll repeat it a third time, for emphasis: you throw them away.  As it so happens, the inverse is the most important property of an on-the-wire data structure.  You can't throw it away.  You have to live with it.

Forever.

Oh, sure, you told your customers that they all have to upgrade to protocol version 3.5, but they're still using 3.2.  Unless you're Blizzard Entertainment, you can't tell them to download the new version every six weeks or go to hell.  Even if you can do that (and statistically speaking, you probably aren't Blizzard Entertainment) you have to keep the old versions of the updater protocol around so that when version 4.0 comes out all the laggards who haven't even run your program since 3.0 can still manage to upgrade.

Here's the best part: your unit tests aren't going help you — at least, not in the same way they would with your in-memory data.  When you change an in-memory data structure, you aren't supposed to have to change your unit tests.  You want the behavior to stay the same, you don't change the tests; if they start failing, you know something is wrong. With your new protocol changes though, you can have tests for the old protocol, and tests for the new protocol, but every time you make a protocol change you need to a new test for every version of the protocol which you still support.  Plus, you probably can't stop supporting older versions of the protocol (see above).

If you've got a message X[3], and you're introducing X[4], you have to make sure that X[4] can talk to X[3] and X[2] and X[1].  Each of those is potentially a new test.  Each one is more work.  Even worse, it's possible to introduce X[4] without realizing that you've done it!  If you have a new, optional argument, let's call it "y", to a dynamically-typed protocol, your old tests (which didn't pass y) will pass.  Your new tests (which do pass y, to the newly-modified X[4]) also pass.  But there's a case which has now arisen which your tests did not detect: y could be passed to a client which only supports X[3], and an error occurs.

If this were some in-memory structures, that case no longer exists.  There is no version of X currently in your code which cannot accept y.  Your tests ensure that.  You have to time-travel into the past for your unit tests to discover the code which would cause them to fail.  You can't just do it once, either: maybe X[3] was designed to ignore all optional parameters.  You have to consider X[2] and X[1].  You have to travel back to all points in time simultaneously.

This is why I said that the cost is exponential: you carry this cost forward with each new supported version that gets released.  Of course, there are ways to reduce it.  You can design your protocol such that arguments which your implementation doesn't understand are ignored.  You can start adding version numbers to everything, or change the name of every message every time some part of its schema changes.  All of these alternatives get tedious after a while.

So what does this have to do with static typing?  Static type declarations can save you a lot of this work.  For one thing, it becomes impossible to forget you're changing the protocol.  Did you change the data's types?  If so, you need to add a compatibility layer.  These static type declarations give you key information: what do the previous versions of the protocol look like?  More importantly, they give your code key information: is an automatic transformation between these two versions of the data format possible?  (If not, is the manual transformation between these two versions correct?)

In a dynamically typed program, you can figure out your in-memory types are doing by running the debugger, inspecting the code that's calling them, and simply reading the code.  Sometimes this can be a bit spread out — in a badly designed system, painfully spread out — but the key point is that all the information you need is right in front of you, in the source code.  If you're working on code that is shipping data elsewhere without an explicit schema, you have to have a full copy of the revision history and some very fancy revision control tools telling you what the protocol looked like in the past.  (Or, perhaps, what the protocol that some other piece of software has developed used to look like in the past.)

Your disk is another kind of wire.  This one is particularly brutal, because while you might be able to tell someone to download a new client to be able to access a service, there is no way you are ever going to get away with saying "just delete all your data and start again.  there's a new version of the format."  When writing objects to disk (or to a database), you might not be talking across a network, but you're still talking to a different program.  A later version of the one you're writing now.  So these constraints all apply to Axiom just as they do to AMP; moreso, actually, because in the case of AMP all the translations can be very simple and ad-hoc, whereas in Axiom the translations between data types need to be specifically implemented as upgraders.

With a network involved, you also have to worry about an additional issue of security.  One way to deal with this is by adding linguistic support to the notion of untrusted code running "somewhere else", but type declarations can provide some benefit as well.  Let's say that you have a function that can be invoked by some networked code:

@myprotocol.expose()
def biggger(number):
    return number * 1000


Seems simple, seems safe enough, right?  'number' is a number taken from the network, and you return a result to the network that is 1000 times bigger.  But... what if 'number' were, instead, a list of 10,000 elements?  Now you've just consumed a huge amount of memory and sent the caller 1000 times as much traffic as they've sent you.  Dynamic typing allows the client side of the network connection to pass in whatever it wants.

Now, let's look at a slightly different implementation of that function:

@myprotocol.expose(number=int)
def biggger(number):
    return number * 1000


Now, your protocol code has a critical hint that it needs to make this code secure.  You might spell it differently ("arguments = [('number', Integer())]" comes to mind), but the idea is that the protocol code now knows: if 'number' is not an integer, don't bother to call this function.  You can, of couse, add checks to make sure that all the methods you want to call on your arguments are safe, but that can get ugly quickly.

Let's break it down.

Static type declarations have a cost.  You (probably) have to type a bunch of additional names for your types, which makes it difficult to write code quickly.  Therefore it is preferable to avoid that cost.

All the information you need about the code at runtime is present when you're looking at your codebase.  Therefore — although you may find its form more convenient — static type declarations don't provide any additional information about the code as it's running.  However, information about the code on opposite ends of the wire may only be in your repository history, or it may not be in your code at all (it could be in a different codebase entirely).  Therefore static typing provides additional information for the wire but not in memory.

At runtime, you only have to deal with one version of an object at a time.  On the wire, you might need to deal with a few different versions simultaneously in the same process.  Static type declarations provide your application with information it may need to interact with those older versions.

At runtime (at least in today's languages) you aren't worried about security inside your process.  Enforcing type safety at compile time doesn't really add any security, especially with popular VMs like the JVM not bothering to enforce type constraints in the bytecode, only in the compiler.  However, static type declarations can help the protocol implementation understand the expectations of the application code so that it does not get invoked with confusing or potentially dangerous values.  Therefore static type declarations can add security on the wire while they can't add security in memory.  (It turns out that if you care about security in memory, you need to do a bunch of other stuff, unrelated to type safety.  When the rest of the world catches up to the E language I may need to revisit my ideas of how type safety help here.)

If you have data that's being sent to another program, you probably need static type declarations for that data.  Or you need a lot of memory to store all those lists I'm about to multiply by 1000 on your server.