A collection of articles, ideas, and rambling from a guy who wrote some software that one time.

Thursday, September 25, 2008

exarkun for president

Jean-Paul Calderone is an amazing hacker.

So, here's the setup.  We're working on this application that runs in the Adobe AIR runtime.  We implemented the AMP protocol for client-server communication.  Superficially, it worked great; limited tests gave us good results.

Then, we started throwing some real data at it.  And it choked.  The runtime would terminate its client socket silently: it would stop delivering data to application code, send a TCP FIN to the server, and not even deliver an event indicating that the socket had gone away.  Nothing.  Nowhere to set a breakpoint, nothing to debug.

The Python implementation of this protocol worked fine; everything got delivered.  The connection was not dropped unless we told it to drop.

I spent all night poring over protocol dumps, trying to figure out what was going wrong.  There were slight differences in where in the data stream it was dying - but for some reason, always cleanly, on a message boundary.

So, I come into the office and I fire up the program and show JP.  We get a tcpdump, and he looks at the output.  He squints at it for a few minutes and says:
"Huh.  It died on message 64.  That's interesting...
Oh wait, that's hex.  What's 64 in hex?  100.
... thoughtful pause ...
Maybe the garbage collector is buggy?"
He was right.  The bug was in the garbage collector.  AIR apparently doesn't think sockets (and apparently, other stuff, like animations) are "real" things that should keep strong references to the things they are feeding events to - so, sometimes they just crap out and get garbage collected, and silently stop delivering events.

5 comments:

manuelg said...

WTF Adobe. Straighten up or you will be left in the dust by Google Chrome and Microsoft Silverlight.

Cory said...

Wha.. what? How do you go from "message 100" to "buggy garbage collector"?

It's not as if I've never made similar leaps of intuition myself, I'm not questioning JP's supreme hackerdom. But I don't see this one at all, even after the fact. :-)

tdavis said...

I vote Glyph for VP, then!

glyph said...

Cory - the leap was, "what's likely to 'do something' on every 100 discrete events?" Since the runtime in question is pretty clearly event-driven, and (in practice) each message generates one and only one discrete event, there aren't many choices: it had to be something lower-level than actionscript, because otherwise we would have gotten some messages delivered. "100" is a pretty arbitrary number, so it would have to be a subsystem that selected a "reasonable lower bound" with no specific technical constraint, i.e. it wasn't a power of 2.

Thanks for making me feel a little smart for at least understanding how he figured it out after the fact ;).

glyph said...

Oops. Not "some messages delivered". I mean, some events delivered, i.e. a connection-lost notification event.