A collection of articles, ideas, and rambling from a guy who wrote some software that one time.

Thursday, February 07, 2008

Highlighting buried treasure in Twisted

I've previously blogged about twisted.python.modules, but it assumes you know about another API inside Twisted, twisted.python.filepath.  Unfortunately this module is rather under-documented and under-publicized, despite being extremely useful.  Unlike a lot of Twisted, much of the code in twisted.python can be extracted and used by itself, regardless of whether the program in question is networked or even event-driven.  This is especially true of FilePath, which is completely blocking, although sometimes I wish there were at least a version of it that wasn't.

A common sort of script that deals with a filesystem is to open each file in a directory hierarchy with a given path and do something to its contents.  For example, let's write a program that prints out a list of all Python modules (with a .py extension) in a tree which contain shebang lines.

Here's the script using good old os.path:
import sys
import os

def os_shebangs(pathname):
for dirpath, dirnames, filenames in os.walk(pathname):
for filename in filenames:
fullpath = os.path.join(dirpath, filename)
if (fullpath.endswith(".py") and
file(fullpath, "rb").readline().startswith("#!")):
yield fullpath

def os_show_shebangs(pathname):
for path in os_shebangs(pathname):
sys.stdout.write("%s: %s\n" % (
path,
file(path, "rb").readline()[2:].strip()))

if __name__ == '__main__':
os_show_shebangs(sys.argv[1])

Pretty normal looking python code; not too much wrong with it.  At 20 lines and 596 characters long, it's not too complex.

Now let's have a look at a similarly idiomatic version using FilePath:
import sys
from twisted.python.filepath import FilePath

def shebangs(path):
for p in path.walk():
if (p.basename().endswith(".py") and
p.open().readline().startswith("#!")):
yield p

def showShebangs(pathobj):
for path in shebangs(pathobj):
sys.stdout.write("%s: %s\n" % (
path.path,
path.open().readline()[2:].strip()))

if __name__ == '__main__':
showShebangs(FilePath(sys.argv[1]))
At 18 lines and 471 characters, it's almost exactly 20% smaller than the version that uses os.path.  However, a small space savings is hardly the most interesting property of this code.  The advantages over the version that uses os.path:
  • It's easier to test.  You can use a fake FilePath object rather than needing to replace the whole "os" module and the "file" builtin.
  • It's easier to read.  You need fewer names; rather than os, os.path, and builtins, the code talks mainly to one object.
  • It's easier to write.  How many of you honestly remembered that "dirpath, dirnames, filenames" is the order of the tuples yielded from os.walk?
  • It's easier to secure.  If you wanted to allow untrusted users to supply input to the os.path version, you need to be very, very careful.  What about "/"?  What about ".."?  With FilePath, you simply supply the input to the 'child' method, and...
    >>> from twisted.python.filepath import FilePath
    >>> fp = FilePath(".")
    >>> x = fp.child("okay")
    >>> y = fp.child("..")
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "twisted/python/filepath.py", line 308, in child
    raise InsecurePath("%r is not a child of %s" % (newpath, self.path))
    twisted.python.filepath.InsecurePath: '/home' is not a child of /home/glyph
    >>> z = fp.child("hello/world")
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "twisted/python/filepath.py", line 305, in child
    raise InsecurePath("%r contains one or more directory separators" % (path,))
    twisted.python.filepath.InsecurePath: 'hello/world' contains one or more directory separators
  • It's easier to extend.  As of revision 22464 of Twisted (i.e. the next release) you can replace twisted.python.filepath.FilePath with twisted.python.zippath.ZipArchive, and this exact same code can operate on zip files.
Not only does FilePath provide these benefits, it has very few dependencies.  Even if you don't like Twisted much, you can use twisted.python.filepath by copying only 3 modules into your project (twisted.python.filepath, twisted.python.win32, and twisted.python.runtime) and twiddling the appropriate imports to be relative.  Since FilePath is only one import for your code, and mostly consists of method calls, it will easily work with Twisted's version or your own.  So, share and enjoy!

4 comments:

oubiwann said...

I've been thinking about sharing the goodness that is FilePath for a few months -- now I don't have to :-) Using it has saved me sooo much time...

glyf said...

Submit a patch! :) This sounds straightforward / well specified enough I'll even commit to a review... (However, something that lets you actually modify the iteration would be better than something that took static strings.)

As far as getting into the stdlib - agitate on python-dev. I'll help you do any necessary coding if you can do the legwork to get everyone to agree that it's desirable (as opposed to one of the 30 "OO" filesystem wrappers that people have written for the stdlib, or nothing at all). I don't have the energy for that.

djfroofy said...

I don't have the energy to agitate on python-dev either nor do I have the required diplomacy skills ;) Anyhow, here's the patch:

http://twistedmatrix.com/trac/attachment/ticket/3044/filepath.py.diff#preview

zooko said...

This kind of thing makes me sad.

There are lots of people who could benefit from twisted.python.filepath, and there are comparable packages which twisted could use in order to gain the benefit without the cost of maintaining the package (at least one of which is being considered for inclusion in the Python Standard Library), but it isn't going to happen -- non-Twisted-requiring projects aren't going to benefit from twisted.python.filepath, and Twisted isn't going to benefit from those other packages, because Twisted doesn't use good packaging technology so that it can use other people's code and other people can use its code in an easy, manageable way.

Frankly, suggesting that people could copy a few source files is the kiss of death, for the prospect of that code being re-used by other people.

Twisted is falling behind because of this. Please ponder the postscript to this page:

http://www.kieranholland.com/code/documentation/nevow-stan/

This guy says, as I interpret: Nevow is technically better, but Django is the future because it makes it easy for people to re-use components in isolation. The same could be said of many of the Twisted and Divmod offerings.