My goal was to show that threads don't scale terribly well in terms of processing concurrent I/O, and in particular, Twisted scales better, even in its slowest, most naive mode than a simple threaded implementation of a similar concept.
Unfortunately I was unable to do this easily, because Debian GNU/Linux (in its default configuration, using a 2.6.7 kernel) is not able to even spawn enough threads to generate a fair test.
My test program was to write a Twisted protocol handler that would take some simple input, perform a trivial computation, generate some output, and close the connection. I then wrote a Twisted client for it, and write a simple threaded reactor which would handle each connection in a separate thread, to compare the amount of time each multiplexor took to finish connecting vs. the amount of time it took to complete the entire test. At around 256 threads, though, I could no longer spawn more threads to handle simultaneous connections, a problem which others have experienced.
This indicates in part the cultural issues surrounding the performance of threads, to wit, nobody really cares enough about good threading performance to even bother making it possible to spawn a lot of them.
This test was carried out on an otherwise quiescent Linux machine, a 2000XP+ Athlon with a gig of RAM, using python 2.3. Without further ado:


I don't expect these graphs to be taken as serious performance measurements - after all, I'm not even providing source code yet. Still, the fact that the threading connection-start performance is so abysmally bad (even compared to
select() - remember, this is using Twisted's default configuration, with no tuning or optimization) and the fact that with an otherwise excellent development machine I don't even have the software to test larger numbers of concurrent connections might give you an idea of why I find it so hard to take this kind of comparison seriously.It has brought to light the fact that although Twisted would be an excellent test platform for different multiplexing mechanisms, nobody has bothered to put together a survey of those yet. I will do a more rigorous benchmark of threads vs. Twisted later this week, release the code, and hopefully some useful work will come of this if it can be used as a simple reactor benchmark.
6 comments:
My reading of the previous benchmark was that he was testing in a CPU-bound situation. At least it seemed like he was doing image processing, which implies significant CPU use. With that kind of load I would expect the performance to be more on par. At least, by the time threads started showing scaling performance problems WRT connections, the application would be hopelessly bogged down anyway. Twisted might degrade better in that situation -- slowly but surely responding to requests -- but it'd be questionable whether you were testing a realistic situation at that point.
In CPU-bound situations it seems reasonable that threads would be faster (putting scaling aside). In that case in Twisted you'd either have to use threads, which can't beat a natively-threaded approach and may introduce additional overhead, or you'll have to refactor your algorithms to work in discrete chunks which is probably even slower. Properly implemented I imagine Twisted should be very close to a natively threaded system (and would probably use threads as well), and maybe that's where the original benchmark was inaccurate. But if you look at that benchmark as a defense of threads, not an attack on Twisted, I think the point is valid regardless of those implementation details.
So... the question is what scaling is more likely? It depends a lot on what you are doing. For an online chat server or a MMORPG the scaling you show here probably applicable. For a typical database-backed web application I think this kind of scaling doesn't mean much; that Apache 1.3 works as well as it does -- with its rather simplistic way of processing connections -- shows that serialized connections can work well. I don't know enough about GUI bottlenecks to say one way or another, though this kind of scaling doesn't really make sense to me in that context. GUIs are a weird middle ground, because async processing is the default, not threading, so you're likely to use threads in very selective ways. Anyway, no client should make that number of connections.
I'd say the most interesting part of this scaling, to me, is how performance degrades in extreme situations... though I don't have a good feel for how you should test that, and at what point you'd really be testing the underlying OS instead of the application.
This isn't really a response to arensito --- it's probably useful as part of the frame for one, though.
The graphs could be more readable if the axes were labeled, especially the first one.
Threads are better for some things, async I/O is better for other things. It's a bit of a grey area really. Can we not leave it at that?
I've been watching this topic over the last week or so with interest, and I think you're doing more harm than good (for yourself and Twisted) by going head-to-head with the guy who published that page.
Although he tried to be objective, he shows signs of being more than a bit annoyed at his treatment in #twisted. You also show a lot of signs of being more than a bit annoyed. In the end, neither of you will win, but both of you will look stupid. Stop and save yourself the bother! :)
(dw here, 'disgruntledme' never got off the ground).
I hope that the benchmarking idea spawned by this discussion becomes something seriously useful, but I am sure that lots of people are tired of hearing about this particular argument, so - sure. After this post I won't mention arensito's benchmarks in particular again.
I ran into this, so I got to look into it in a moderate degree of detail. In case it's helpful, here's what I found out:
NPTL-based systems use the maximum stack size (sh's ulimit -s value) as the default stack thread size, and promptly allocate all of this memory space. NPTL is the normal threading library in recent glibcs and was patched into older versions (of both glibc and the kernel) by at least Red Hat as far back as Red Hat 9 or so. (If the maximum stack size is set to unlimited I believe NPTL defaults to 8M.)
Programs that use NPTL can always specify the thread stack size, but CPython doesn't and so gets the default, which in turn is whatever you've set it to or the system's default has been set to on that particular system and suchlike.
One can manipulate ulimit -s in order to make threads use (much) less stack. If one spawns external programs, one may need to have one's program yoink it back up again so that they don't explode with really small stack sizes. (These days, 64k apparently qualifies as 'really small'.)
Pre-NPTL systems have/had a different default thread stack size scheme with smaller numbers. I don't know what they are; I never had any problems with my code on our RH 7.3 systems, so I never wound up looking into the pthreads and glibc code there.
Ooo. Thanks for this advice. I thought there was something you could do with ulimit that would fix this.
Post a Comment