Far too many things can stop the BLOB

Local mutable filesystem usage is a scalability problem.

It occurs to me that the lack of a standard, well-supported, memory-efficient interface for BLOBs in multiple programming languages is one of the primary driving factors of poor scalability characteristics of open source SaaS applications.

Applications like Gitlab, Redmine, Trac, Wordpress, and so on, all need to store potentially large files (“attachments”). Frequently, they elect to store these attachments (at least by default) in a dedicated filesystem directory. This leads to a number of tricky concurrency issues, as the filesystem has different (and divorced) concurrency semantics from the backend database, and resides only on the individual API nodes, rather than in the shared namespace of the attached database.

Some databases do support writing to BLOBs like files. Postgres, SQLite, and Oracle do, although it seems MySQL lags behind in this area (although I’d love to be corrected on this front). But many higher-level API bindings for these databases don’t expose support for BLOBs in an efficient way.

Directly using the filesystem, as opposed to a backing service, breaks the “expected” scaling behavior of the front-end portion of a web application. Using an object store, like Cloud Files or S3, is a good option to achieve high scalability for public-facing applications, but that creates additional deployment complexity.

So, as both a plea to others and a note to myself: if you’re writing a database-backed application that needs to store some data, please consider making “store it in the database as BLOBs” an option. And if your particular database client library doesn’t support it, consider filing a bug.

Deploying Python Applications with Docker - A Suggestion

A template for deploying Python applications into Docker containers.

Deploying python applications is much trickier than it should be.

Docker can simplify this, but even with Docker, there are a lot of nuances around how you package your python application, how you build it, how you pull in your python and non-python dependencies, and how you structure your images.

I would like to share with you a strategy that I have developed for deploying Python apps that deals with a number of these issues. I don’t want to claim that this is the only way to deploy Python apps, or even a particularly right way; in the rapidly evolving containerization ecosystem, new techniques pop up every day, and everyone’s application is different. However, I humbly submit that this process is a good default.

Rather than equivocate further about its abstract goodness, here are some properties of the following container construction idiom:

  1. It reduces build times from a naive “sudo setup.py install” by using Python wheels to cache repeatably built binary artifacts.
  2. It reduces container size by separating build containers from run containers.
  3. It is independent of other tooling, and should work fine with whatever configuration management or container orchestration system you want to use.
  4. It uses existing Python tooling of pip and virtualenv, and therefore doesn’t depend heavily on Docker. A lot of the same concepts apply if you have to build or deploy the same Python code into a non-containerized environment. You can also incrementally migrate towards containerization: if your deploy environment is not containerized, you can still build and test your wheels within a container and get the advantages of containerization there, as long as your base image matches the non-containerized environment you’re deploying to. This means you can quickly upgrade your build and test environments without having to upgrade the host environment on finicky continuous integration hosts, such as Jenkins or Buildbot.

To test these instructions, I used Docker 1.5.0 (via boot2docker, but hopefully that is an irrelevant detail). I also used an Ubuntu 14.04 base image (as you can see in the docker files) but hopefully the concepts should translate to other base images as well.

In order to show how to deploy a sample application, we’ll need a sample application to deploy; to keep it simple, here’s some “hello world” sample code using Klein:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# deployme/__init__.py
from klein import run, route

@route('/')
def home(request):
    request.setHeader("content-type", "text/plain")
    return 'Hello, world!'

def main():
    run("", 8081)

And an accompanying setup.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from setuptools import setup, find_packages

setup (
    name             = "DeployMe",
    version          = "0.1",
    description      = "Example application to be deployed.",
    packages         = find_packages(),
    install_requires = ["twisted>=15.0.0",
                        "klein>=15.0.0",
                        "treq>=15.0.0",
                        "service_identity>=14.0.0"],
    entry_points     = {'console_scripts':
                        ['run-the-app = deployme:main']}
)

Generating certificates is a bit tedious for a simple example like this one, but in a real-life application we are likely to face the deployment issue of native dependencies, so to demonstrate how to deal with that issue, that this setup.py depends on the service_identity module, which pulls in cryptography (which depends on OpenSSL) and its dependency cffi (which depends on libffi).

To get started telling Docker what to do, we’ll need a base image that we can use for both build and run images, to ensure that certain things match up; particularly the native libraries that are used to build against. This also speeds up subsquent builds, by giving a nice common point for caching.

In this base image, we’ll set up:

  1. a Python runtime (PyPy)
  2. the C libraries we need (the libffi6 and openssl ubuntu packages)
  3. a virtual environment in which to do our building and packaging
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# base.docker
FROM ubuntu:trusty

RUN echo "deb http://ppa.launchpad.net/pypy/ppa/ubuntu trusty main" > \
    /etc/apt/sources.list.d/pypy-ppa.list

RUN apt-key adv --keyserver keyserver.ubuntu.com \
                --recv-keys 2862D0785AFACD8C65B23DB0251104D968854915
RUN apt-get update

RUN apt-get install -qyy \
    -o APT::Install-Recommends=false -o APT::Install-Suggests=false \
    python-virtualenv pypy libffi6 openssl

RUN virtualenv -p /usr/bin/pypy /appenv
RUN . /appenv/bin/activate; pip install pip==6.0.8

The apt options APT::Install-Recommends and APT::Install-Suggests are just there to prevent python-virtualenv from pulling in a whole C development toolchain with it; we’ll get to that stuff in the build container. In the run container, which is also based on this base container, we will just use virtualenv and pip for putting the already-built artifacts into the right place. Ubuntu expects that these are purely development tools, which is why it recommends installation of python development tools as well.

You might wonder “why bother with a virtualenv if I’m already in a container”? This is belt-and-suspenders isolation, but you can never have too much isolation.

It’s true that in many cases, perhaps even most, simply installing stuff into the system Python with Pip works fine; however, for more elaborate applications, you may end up wanting to invoke a tool provided by your base container that is implemented in Python, but which requires dependencies managed by the host. By putting things into a virtualenv regardless, we keep the things set up by the base image’s package system tidily separated from the things our application is building, which means that there should be no unforseen interactions, regardless of how complex the application’s usage of Python might be.

Next we need to build the base image, which is accomplished easily enough with a docker command like:

1
$ docker build -t deployme-base -f base.docker .;

Next, we need a container for building our application and its Python dependencies. The dockerfile for that is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# build.docker
FROM deployme-base

RUN apt-get install -qy libffi-dev libssl-dev pypy-dev
RUN . /appenv/bin/activate; \
    pip install wheel

ENV WHEELHOUSE=/wheelhouse
ENV PIP_WHEEL_DIR=/wheelhouse
ENV PIP_FIND_LINKS=/wheelhouse

VOLUME /wheelhouse
VOLUME /application

ENTRYPOINT . /appenv/bin/activate; \
           cd /application; \
           pip wheel .

Breaking this down, we first have it pulling from the base image we just built. Then, we install the development libraries and headers for each of the C-level dependencies we have to work with, as well as PyPy’s development toolchain itself. Then, to get ready to build some wheels, we install the wheel package into the virtualenv we set up in the base image. Note that the wheel package is only necessary for building wheels; the functionality to install them is built in to pip.

Note that we then have two volumes: /wheelhouse, where the wheel output should go, and /application, where the application’s distribution (i.e. the directory containing setup.py) should go.

The entrypoint for this image is simply running “pip wheel” with the appropriate virtualenv activated. It runs against whatever is in the /application volume, so we could potentially build wheels for multiple different applications. In this example, I’m using pip wheel . which builds the current directory, but you may have a requirements.txt which pins all your dependencies, in which case you might want to use pip wheel -r requirements.txt instead.

At this point, we need to build the builder image, which can be accomplished with:

1
$ docker build -t deployme-builder -f build.docker .;

This builds a deployme-builder that we can use to build the wheels for the application. Since this is a prerequisite step for building the application container itself, you can go ahead and do that now. In order to do so, we must tell the builder to use the current directory as the application being built (the volume at /application) and to put the wheels into a wheelhouse directory (one called wheelhouse will do):

1
2
3
4
5
$ mkdir -p wheelhouse;
$ docker run --rm \
         -v "$(pwd)":/application \
         -v "$(pwd)"/wheelhouse:/wheelhouse \
         deployme-builder;

After running this, if you look in the wheelhouse directory, you should see a bunch of wheels built there, including one for the application being built:

1
2
3
4
5
6
$ ls wheelhouse
DeployMe-0.1-py2-none-any.whl
Twisted-15.0.0-pp27-none-linux_x86_64.whl
Werkzeug-0.10.1-py2-none-any.whl
cffi-0.9.0-py2-none-any.whl
# ...

At last, time to build the application container itself. The setup for that is very short, since most of the work has already been done for us in the production of the wheels:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# run.docker
FROM deployme-base

ADD wheelhouse /wheelhouse
RUN . /appenv/bin/activate; \
    pip install --no-index -f wheelhouse DeployMe

EXPOSE 8081

ENTRYPOINT . /appenv/bin/activate; \
           run-the-app

During build, this dockerfile pulls from our shared base image, then adds the wheelhouse we just produced as a directory at /wheelhouse. The only shell command that needs to run in order to get the wheels installed is pip install TheApplicationYouJustBuilt, with two options: --no-index to tell pip “don’t bother downloading anything from PyPI, everything you need should be right here”, and, -f wheelhouse which tells it where “here” is.

The entrypoint for this one activates the virtualenv and invokes run-the-app, the setuptools entrypoint defined above in setup.py, which should be on the $PATH once that virtualenv is activated.

The application build is very simple, just

1
$ docker build -t deployme-run -f run.docker .;

to build the docker file.

Similarly, running the application is just like any other docker container:

1
$ docker run --rm -it -p 8081:8081 deployme-run

You can then hit port 8081 on your docker host to load the application.

The command-line for docker run here is just an example; for example, I’m passing --rm so that if you run this example just so that it won’t clutter up your container list. Your environment will have its own way to call docker run, how to get your VOLUMEs and EXPOSEd ports mapped, and discussing how to orchestrate your containers is out of scope for this post; you can pretty much run it however you like. Everything the image needs is built in at this point.

To review:

  1. have a common base container that contains all your non-Python (C libraries and utilities) dependencies. Avoid installing development tools here.
  2. use a virtualenv even though you’re in a container to avoid any surprises from the host Python environment
  3. have a “build” container that just makes the virtualenv and puts wheel and pip into it, and runs pip wheel
  4. run the build container with your application code in a volume as input and a wheelhouse volume as output
  5. create an application container by starting from the same base image and, once again not installing any dev tools, pip install all the wheels that you just built, turning off access to PyPI for that installation so it goes quickly and deterministically based on the wheels you’ve built.

While this sample application uses Twisted, it’s quite possible to apply this same process to just about any Python application you want to run inside Docker.

I’ve put a sample project up on Github which contain all the files referenced here, as well as “build” and “run” shell scripts that combine the necessary docker commandlines to go through the full process to build and run this sample app. While it defaults to the PyPy runtime (as most networked Python apps generally should these days, since performance is so much better than CPython), if you have an application with a hard CPython dependency, I’ve also made a branch and pull request on that project for CPython, and you can look at the relatively minor patch required to get it working for CPython as well.

Now that you have a container with an application in it that you might want to deploy, my previous write-up on a quick way to securely push stuff to a production service might be of interest.

(Once again, thanks to my employer, Rackspace, for sponsoring the time for me to write this post. Thanks also to Shawn Ashlee and Jesse Spears for helping me refine these ideas and listening to me rant about them. However, that expression of gratitude should not be taken as any kind of endorsement from any of these parties as to my technical suggestions or opinions here, as they are entirely my own.)

Docker Dev to Prod in Just A Few Easy Steps

Get your app into production right now.

It seems that docker is all the rage these days. Docker has popularized a powerful paradigm for repeatable, isolated deployments of pretty much any application you can run on Linux. There are numerous highly sophisticated orchestration systems which can leverage Docker to deploy applications at massive scale. At the other end of the spectrum, there are quick ways to get started with automated deployment or orchestrated multi-container development environments.

When you're just getting started, this dazzling array of tools can be as bewildering as it is impressive.

A big part of the promise of docker is that you can build your app in a standard format on any computer, anywhere, and then run it. As docker.com puts it:

“... run the same app, unchanged, on laptops, data center VMs, and any cloud ...”

So when I started approaching docker, my first thought was: before I mess around with any of this deployment automation stuff, how do I just get an arbitrary docker container that I've built and tested on my laptop shipped into the cloud?

There are a few documented options that I came across, but they all had drawbacks, and didn't really make the ideal tradeoff for just starting out:

  1. I could push my image up to the public registry and then pull it down. While this works for me on open source projects, it doesn't really generalize.
  2. I could run my own registry on a server, and push it there. I can either run it plain-text and risk the unfortunate security implications that implies, deal with the administrative hassle of running my own certificate authority and propagating trust out to my deployment node, or spend money on a real TLS certificate. Since I'm just starting out, I don't want to deal with any of these hassles right away.
  3. I could re-run the build on every host where I intend to run the application. This is easy and repeatable, but unfortunately it means that I'm missing part of that great promise of docker - I'm running potentially subtly different images in development, test, and production.

I think I have figured out a fourth option that is super fast to get started with, as well as being reasonably secure.

What I have done is:

  1. run a local registry
  2. build an image locally - testing it until it works as desired
  3. push the image to that registry
  4. use SSH port forwarding to "pull" that image onto a cloud server, from my laptop

Before running the registry, you should set aside a persistent location for the registry's storage. Since I'm using boot2docker, I stuck this in my home directory, like so:

1
me@laptop$ mkdir -p ~/Documents/Docker/Registry

To run the registry, you need to do this:

1
2
3
4
5
6
7
8
me@laptop$ docker pull registry
...
Status: Image is up to date for registry:latest
me@laptop$ docker run --name registry --rm=true -p 5000:5000 \
    -e GUNICORN_OPTS=[--preload] \
    -e STORAGE_PATH=/registry \
    -v "$HOME/Documents/Docker/Registry:/registry" \
    registry

To briefly explain each of these arguments - --name is just there so I can quickly identify this as my registry container in docker ps and the like; --rm=true is there so that I don't create detritus from subsequent runs of this container, -p 5000:5000 exposes the registry to the docker host, -e GUNICORN_OPTS=[--preload] is a workaround for a small bug, STORAGE_PATH=/registry tells the registry to look in /registry for its images, and the -v option points /registry at the directory we previously created above.

It's important to understand that this registry container only needs to be running for the duration of the commands below. Spin it up, push and pull your images, and then you can just shut it down.

Next, you want to build your image, tagging it with localhost.localdomain.

1
2
me@laptop$ cd MyDockerApp
me@laptop$ docker build -t localhost.localdomain:5000/mydockerapp .

Assuming the image builds without incident, the next step is to send the image to your registry.

1
me@laptop$ docker push localhost.localdomain:5000/mydockerapp

Once that has completed, it's time to "pull" the image on your cloud machine, which - again, if you're using boot2docker, like me, can be done like so:

1
2
3
me@laptop$ ssh -t -R 127.0.0.1:5000:"$(boot2docker ip 2>/dev/null)":5000 \
    mycloudserver.example.com \
    'docker pull localhost.localdomain:5000/mydockerapp'

If you're on Linux and simply running Docker on a local host, then you don't need the "boot2docker" command:

1
2
3
me@laptop$ ssh -t -R 127.0.0.1:5000:127.0.0.1:5000 \
    mycloudserver.example.com \
    'docker pull localhost.localdomain:5000/mydockerapp'

Finally, you can now run this image on your cloud server. You will of course need to decide on appropriate configuration options for your applications such as -p, -v, and -e:

1
2
3
4
me@laptop$ ssh mycloudserver.example.com \
    'docker run -d --restart=always --name=mydockerapp \
        -p ... -v ... -e ... \
        localhost.localdomain:5000/mydockerapp'

To avoid network round trips, you can even run the previous two steps as a single command:

1
2
3
4
5
6
me@laptop$ ssh -t -R 127.0.0.1:5000:"$(boot2docker ip 2>/dev/null)":5000 \
    mycloudserver.example.com \
    'docker pull localhost.localdomain:5000/mydockerapp && \
     docker run -d --restart=always --name=mydockerapp \
        -p ... -v ... -e ... \
        localhost.localdomain:5000/mydockerapp'

I would not recommend setting up any intense production workloads this way; those orchestration tools I mentioned at the beginning of this article exist for a reason, and if you need to manage a cluster of servers you should probably take the time to learn how to set up and manage one of them.

However, as far as I know, there's also nothing wrong with putting your application into production this way. If you have a simple single-container application, then this is a reasonably robust way to run it: the docker daemon will take care of restarting it if your machine crashes, and running this command again (with a docker rm -f mydockerapp before docker run) will re-deploy it in a clean, reproducible way.

So if you're getting started exploring docker and you're not sure how to get a couple of apps up and running just to give it a spin, hopefully this can set you on your way quickly!

(Many thanks to my employer, Rackspace, for sponsoring the time for me to write this post. Thanks also to Jean-Paul Calderone, Alex Gaynor, and Julian Berman for their thoughtful review. Any errors are surely my own.)