Data In, Garbage Out

Wednesday June 04, 2008
"The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information." —Alan Perlis
I switched to blogger recently expecting a more "professional" blogging experience.  I thought I'd be able to use a GUI editor and not concern myself with the details of the blog engine.  Apparently I was wrong.

Writing that last post, I had some pretty serious problems with getting the formatting to come out right.  Blogger does a couple of really terrible things:
  • When you switch between "Compose" and "Edit HTML" views, some amount of whitespace (although not all of it) is destroyed.
  • Even when posting using the ATOM API, the posted HTML is mangled in semi-arbitrary ways.
    • Properly-quoted "<" and ">" (i.e. "&lt;" and "&gt;") are quoted again.
    • Additional line-breaks are added.
    • &nbsp; is converted to white-space, and then
    • white space is collapsed.
This is one of the reasons that I'm such a stickler for treating data as structured data, and not making arbitrary heuristic guesses about it.  It's not just a matter of handling obscure, nerdy edge cases that average users won't run into.  In fact, it's the opposite.  Nerds (like myself) can figure out whether you're double-quoting your HTML entities or doing improper whitespace conversions.  But what does a regular Joe do when a "frustrated" smiley (">.<") gets converted into some incomprehensible soup of HTML?

I was reminded of this same issue when reading a page on the Habari wiki:

"If you are going to produce real XHTML in a tool usable by ordinary users, then you cannot do it by string concatenation. You need to assemble your content by serializing an XML DOM tree.

If you want to allow plugins, then your plugin API cannot allow plugin authors to stick arbitrary strings in the output. Rather, they should be allowed to add nodes to the DOM tree, or to manipulate existing ones."

Strangely enough, this page concludes that the important thing is not to build their next-generation blogging tool on top of a technology that lets them produce valid output (serializing DOM trees) but that the important thing is not producing valid output, but string concatenation.  They very clearly put an implementation technique above a good experience for users.

(This is your brain.  This is your brain on PHP.  Any questions?)

I don't want to pick on the Habari developers overmuch.  After all, the problem that inspired this post was with Blogger, and Wordpress has the same issue.  In fact, the Habari guys are mostly notable for having considered the implications of their decision so carefully; it's just a surprise to me that they walked all the way up to the right answer, looked at it, made sure it was right, and then decided to ignore it and keep on going.

Here's the surprise for the Habari developers, and basically everyone else who writes web applications that process HTML: it has nothing to do with XHTML.  It is a general principle of software development.  The only reason you notice when you're doing XHTML is that the browser isn't correcting for hundreds of minor mistakes, and rather than screwing up immediately it screws up one time in a thousand when a user managed to type a "<" or a "&".

You know what else you can't build with string concatenation?  AVIs.  PNGs.  SWFs.  Lots of data on the web is treated as structured, but only because it's too hard for the people who generally build web applications to generate it.  If you want to write a program that takes input processes it, and returns output, you need an intermediary structure to hold that data so that you can ensure its validity.

That's not to say that it's always a bad idea to have user interfaces that allow people to type in a syntax that they know and understand, like an "HTML" view.  Those interfaces might even be forgiving and correct for lots of errors.  Adding line-breaks so that people can type newlines in a mishmash of pseudo-HTML is okay, as long as you know where that ends and your actual structured content begins.  For example, if you include a WYSIWYG GUI editor, you should probably internally make sure that WYS really is WYG and you're not making the same kind of heuristic guesses about the data that your own tool generated as some stuff that a user with only a smattering of HTML knowledge typed in directly.

Keeping structured data structured is near and dear to my heart in large part because as systems get ultra large, the different pieces need to be able to talk to each other using clear and unambiguous formats.  These points of integration, the places where system A talks to system B (a blogging system talks to a web browser or a blogging client, for example) are absolutely the most critical pieces to test, test, and test again.  If you have a bug in your system, you can find it and fix it; but if you have a bug which only arises from an interaction between your system and two others, your test environment needs to be 3 times bigger, and the error is at least 3 times harder to catch.  But it gets worse.  If you're dealing with 4 systems, then your test environment is 4 times bigger - but the bug is 6 times harder to catch.  And so on.

Fred Brooks observed that adding more programmers to a project running behind schedule makes it later.  This is because of the additional channels of communication.  Now imagine that one of your developers has a curious speech defect: when he says "lasagna" he actually means "critical bug", and vice versa.  When he hears one, he understands it as the other.  Working alone, this is a harmless eccentricity, but as soon as you put other developers into the mix, strange effects start taking place.  He desperately tries to tell them about the delicious lasagna he had last night, and they can't understand why he's losing sleep over it.  Or, he is sanguine as his fellow engineers tell him about all the italian food they're eating, while the business is losing millions of dollars.

It's sort of like if every time he said "<" the other developers understood him to mean "&lt;".

If I ever have more than a few hours to work on it, eventually I'll deploy my own blogging platform and I'll know that it can handle HTML correctly.  Until then though, I've worked out a strategy for posting to blogger which seems to mostly preserve the formatting that I want to see.  I figure that other Python developers might be interested in this, since I frequently see posts to blogger which eat indentation.
  1. I use ScribeFire as my HTML editor.  It manages OK, except it doesn't include linebreaks, <p>s or <div>s to separate lines.  So, leave the "Convert Line Breaks" option on in your blog's settings.
  2. In "Settings -> Basic -> Global Settings", disable "show compose mode for all your blogs".  The compose view is destructive, and switching between it and "Edit HTML" will eat whitespace each time you do it; it also seems to sometimes eat bits of formatting when you publish even if it's just on the page.
  3. Edit a post in ScribeFire.  To save drafts, use the "save as note" functionality.  This doesn't publish it to be a blogger draft, but there's no way to get the data into blogger directly.  You can use the HTML tab as you normally would, to add tags that aren't supported (such as "<pre>").
  4. Switch to the HTML ("<A>") tab in scribefire.
    1. select all.
    2. copy.
  5. Click "New Post" in the blogger web UI.
    1. click in the text field.
    2. paste.
The presence of numerous properly-escaped HTML characters in this post should be an indication that it works.