"The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information." — Alan PerlisI switched to blogger recently expecting a more "professional" blogging experience. I thought I'd be able to use a GUI editor and not concern myself with the details of the blog engine. Apparently I was wrong.
Writing that last post, I had some pretty serious problems with getting the formatting to come out right. Blogger does a couple of really terrible things:
- When you switch between "Compose" and "Edit HTML" views, some amount of whitespace (although not all of it) is destroyed.
- Even when posting using the ATOM API, the posted HTML is mangled in semi-arbitrary ways.
- Properly-quoted "<" and ">" (i.e. "<" and ">") are quoted again.
- Additional line-breaks are added.
- is converted to white-space, and then
- white space is collapsed.
I was reminded of this same issue when reading a page on the Habari wiki:
Strangely enough, this page concludes that the important thing is not to build their next-generation blogging tool on top of a technology that lets them produce valid output (serializing DOM trees) but that the important thing is not producing valid output, but string concatenation. They very clearly put an implementation technique above a good experience for users."If you are going to produce real XHTML in a tool usable by ordinary users, then you cannot do it by string concatenation. You need to assemble your content by serializing an XML DOM tree.
If you want to allow plugins, then your plugin API cannot allow plugin authors to stick arbitrary strings in the output. Rather, they should be allowed to add nodes to the DOM tree, or to manipulate existing ones."
(This is your brain. This is your brain on PHP. Any questions?)
I don't want to pick on the Habari developers overmuch. After all, the problem that inspired this post was with Blogger, and Wordpress has the same issue. In fact, the Habari guys are mostly notable for having considered the implications of their decision so carefully; it's just a surprise to me that they walked all the way up to the right answer, looked at it, made sure it was right, and then decided to ignore it and keep on going.
Here's the surprise for the Habari developers, and basically everyone else who writes web applications that process HTML: it has nothing to do with XHTML. It is a general principle of software development. The only reason you notice when you're doing XHTML is that the browser isn't correcting for hundreds of minor mistakes, and rather than screwing up immediately it screws up one time in a thousand when a user managed to type a "<" or a "&".
You know what else you can't build with string concatenation? AVIs. PNGs. SWFs. Lots of data on the web is treated as structured, but only because it's too hard for the people who generally build web applications to generate it. If you want to write a program that takes input processes it, and returns output, you need an intermediary structure to hold that data so that you can ensure its validity.
That's not to say that it's always a bad idea to have user interfaces that allow people to type in a syntax that they know and understand, like an "HTML" view. Those interfaces might even be forgiving and correct for lots of errors. Adding line-breaks so that people can type newlines in a mishmash of pseudo-HTML is okay, as long as you know where that ends and your actual structured content begins. For example, if you include a WYSIWYG GUI editor, you should probably internally make sure that WYS really is WYG and you're not making the same kind of heuristic guesses about the data that your own tool generated as some stuff that a user with only a smattering of HTML knowledge typed in directly.
Keeping structured data structured is near and dear to my heart in large part because as systems get ultra large, the different pieces need to be able to talk to each other using clear and unambiguous formats. These points of integration, the places where system A talks to system B (a blogging system talks to a web browser or a blogging client, for example) are absolutely the most critical pieces to test, test, and test again. If you have a bug in your system, you can find it and fix it; but if you have a bug which only arises from an interaction between your system and two others, your test environment needs to be 3 times bigger, and the error is at least 3 times harder to catch. But it gets worse. If you're dealing with 4 systems, then your test environment is 4 times bigger - but the bug is 6 times harder to catch. And so on.
Fred Brooks observed that adding more programmers to a project running behind schedule makes it later. This is because of the additional channels of communication. Now imagine that one of your developers has a curious speech defect: when he says "lasagna" he actually means "critical bug", and vice versa. When he hears one, he understands it as the other. Working alone, this is a harmless eccentricity, but as soon as you put other developers into the mix, strange effects start taking place. He desperately tries to tell them about the delicious lasagna he had last night, and they can't understand why he's losing sleep over it. Or, he is sanguine as his fellow engineers tell him about all the italian food they're eating, while the business is losing millions of dollars.
It's sort of like if every time he said "<" the other developers understood him to mean "<".
If I ever have more than a few hours to work on it, eventually I'll deploy my own blogging platform and I'll know that it can handle HTML correctly. Until then though, I've worked out a strategy for posting to blogger which seems to mostly preserve the formatting that I want to see. I figure that other Python developers might be interested in this, since I frequently see posts to blogger which eat indentation.
- I use ScribeFire as my HTML editor. It manages OK, except it doesn't include linebreaks, <p>s or <div>s to separate lines. So, leave the "Convert Line Breaks" option on in your blog's settings.
- In "Settings -> Basic -> Global Settings", disable "show compose mode for all your blogs". The compose view is destructive, and switching between it and "Edit HTML" will eat whitespace each time you do it; it also seems to sometimes eat bits of formatting when you publish even if it's just on the page.
- Edit a post in ScribeFire. To save drafts, use the "save as note" functionality. This doesn't publish it to be a blogger draft, but there's no way to get the data into blogger directly. You can use the HTML tab as you normally would, to add tags that aren't supported (such as "<pre>").
- Switch to the HTML ("<A>") tab in scribefire.
- select all.
- copy.
- Click "New Post" in the blogger web UI.
- click in the text field.
- paste.
3 comments:
I remembered I started my blogging in Blogger, and found I even could not embed any javascript in the page, then I purchased web hosting, and switched to WordPress.
Never look back.
Hey glyph, it's Owen, remember Brian Urbanek's buddy from college? I don't know if you stumbled on Habari on your own or if you found it through me via Brian, but finding you writing about it seemed a bit coincidental for me.
I just wanted to agree that it is a bit strange and unfortunate that Habari walked up to the correct answer and then dismissed it. As you say, this is the nature of PHP development. Since the language is structured around string concatenation, most PHP developers are familiar with this paradigm, and input filtering of user submissions is predominantly an act of string concatenation, it seemed wrong to ignore that as a deciding factor.
It would be great to have Habari build output using a DOM, but I think the language isn't really up to the task. The tedium of it would outweigh any benefit. I am under the impression (though I'm not as expert as you in the matter) that Python would have similar issues. That is to say, ideally, the used language would be more suited to producing dom-structured output than simply having functions that give you access to the DOM.
As a poor example, either language would be better if you could just assign raw XHTML into a variable, and then manipulate that variable using hierarchical tree members, like html.body.divname = 'whatever'. Maybe Python does this, but PHP's methods for allowing this are strangely complex for being a language born of the need to produce web output.
Anyway, I think that's what led us to the conclusion that using a DOM to produce valid XHTML is too hard to attempt to implement, which hopefully explains to some degree why we strangely arrived at the conclusion that DOM is better and still didn't use it.
That said, WordPress makes no attempt to be valid or investigate the issues. Perhaps the reader kunxi should "look back".
Hi Owen! Thanks for your comments. I'm glad to find them thoroughly reasonable - I hope to do as well on blogs where people critique Twisted.
Yes, I did find Habari through Brian. It was that introduction thing that originally got me thinking along these lines, but Blogger's brokenness got me pissed-off enough to actually write about it :-).
As far as Python vs. PHP: no, Python doesn't have similar issues. By way of an example, one of the open-source products that Divmod (my company) offers is Nevow, a templating library for Python which produces (hopefully) valid XHTML via DOM manipulation. It uses a technique called template precompilation which makes it (at least theoretically) as fast as string concatenation for building complex DOM structures, and (at least theoretically) you can't create invalid DOM output. It also uses some python metaprogramming tricks to allow you to inline DOM literals in your code - although this technique is hackish and generally it's preferred that you use templates, it allows you to get the convenience of a language like PHP for quick hacks without sacrificing correctness.
This problem - encouraging programmers to emit corrupt data - is almost entirely unique to PHP.
Let me say that I think Habari is an interesting and relatively nice piece of software, especially in its category of blogging engines. But let me be clear about my feelings about what it's built on:
PHP is garbage. Utter, unadulterated garbage. The syntax is ambiguous, programs' correctness is dependent upon configuration file options, the object model is broken, and everything that the language strongly encourages you to do is wrong.
Dijkstra said, "It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration." But this, I believe, was hyperbole. You can learn a lot about good programming from BASIC. You can do structured programming in BASIC. Eventually you outgrow it, and (if you are diligent programmer who seeks to improve) you will sense that you need to express things that it will not let you.
So, BASIC will not help you very much. PHP, however, actively fights against good programming style. It requires you to use global state everywhere. It gives you a million conveniences for programming techniques that are just wrong; for example, the aforementioned string concatenation, and the "array" type which strongly suggests that you should do everything with giant piles of semi-coherent unstructured data.
Most of all, PHP simultaneously attacks the two pillars of good programming: writing libraries, and paying attention to correctness. PHP doesn't have namespaces, and viciously punishes you for attempting to emulate them with classes or objects. PHP doesn't have a notion of a "library" or a "module" or even C++'s gross "translation unit"; there are just files, and you include them however you want, making it difficult to set up integrated deployment environment. You can't really be sure if a program is correct in PHP no matter how much analysis you do, because an administrator might always trip a php.ini option that renders all of your code invalid.
And don't get me started on the quality (or lack thereof) of the runtime implementation. Spoiler alert: it's not good.
I could go on... for days, probably; maybe even weeks. And I have limited exposure to PHP. Every time I approach it for another day, I learn another day's worth of terrible things about it.
I am not religious about Python. There are lots of good languages out there. Common lisp, Scheme, Ruby are good; certain JavaScript runtimes are passable, and even Java or C# can get the job done, when it comes to web programming. But I do seriously hope that you'll consider porting the ideas about blogging that you've got in Habari to an environment that will be less actively hostile to making it good.
Post a Comment