April 2010

I’ll talk about some back-end stuff for a change, showing what can kill your server especially when it comes to external data (aka I/O) and hopefully how solutions are being found.

Let’s take that simple piece of PHP code :

<?php
$req = "http://pipes.yahoo.com/pipes/pipe.run?_id=…&_render=php";
$data = unserialize(file_get_contents($req));

It grabs some external content and might take some time to achieve this depending of Yahoo! Pipes. This is where the problem starts to itch. When the server gonna call that page it’ll do mostly nothing but waiting for the external data to arrive.

Using Apache, a process or a thread is spawned, waiting on external data and, more importantly, consuming memory and CPU by just waiting.

Going asynchronous would mean being able to do other stuff, answer to other requests while waiting.

Unfortunately, I don’t know any solutions for that problem in PHP. PHP is designed to be executed as fast as possible and not for doing heavy operations. Big websites that are using PHP use it in general as pure front-end language (for example, the before-Google YouTube).

They are a couple of solutions to go async. Some come from a specific language that is asynchronous by design, like JavaScript (with node.js) or Erlang (to name a few). Ruby can achieve this by using EventMachine and Python (which I explored more) has a couple of solutions (Twisted, Tornado (coming from FriendFeed, bought by Facebook), Eventlet, Gevent, …). Take this dummy WSGI code:

from time import sleep

def application(environ, start_response):
 start_response("200 OK",
		[("Content-Type", "text/plain")])
 sleep(1)
 return ["Hello, world!"]

The external stuff you’re waiting on is simulated by a sleep, which makes it easy to understand. Up to you to fetch external data that will take a fixed time to arrive (which is the same).

When you run this particular code, a process/thread will be dedicated for that during one second blocking other incoming ones, right? You’ll need as many workers/threads/processes as requests to handle them all in one second. Suboptimal since those particular processes aren’t doing anything special.

Take a look at some tries I made.

Eventlet and Gevent, the more convenient to use imho, will run each incoming requests into a coroutine (like a sub process) enabling the main process to do something else when some time is available. The only thing to change in the code above is where the sleep function comes from.

from eventlet import sleep
from gevent import sleep

And, of course, to run it using the appropriate server (the one from the library, Gunicorn or Spawning (Eventlet only)). Those libraries are doing monkeypatching so no code has to be modified in order to become asynchronous. I.e. using urllib.urlopen to fetch external data will not block the whole process. No code changes are required.

Twisted or Tornado, for example, are using a reactor model, which forces you to handle asynchronous code with callbacks. It’s closer to the metal but the learning curve might gives you some headaches. I do love Twisted, but it’s sometimes just too much, really.

Python has another project, running on top of Python Stackless, called Syncless dedicated to async code. It looks quite promising.

Ruby 1.9 will get coroutines under the name of Fibers, until that EventMachine seems the way to go. It looks like super clean Twisted too me. I hope not hurting too much feeling by saying that. (BTW, there is a great article from the SuperFeedr guys: Ruby Fibers may confuse)

One very interesting project, and very young as well, is node.js. Basically server-side JavaScript. This one is asynchronous by design and totally awesome if you love JavaScript or coming out of hell if you don’t like it. I do like it.

With the web evolving the way it does, where data is coming from multiple sources, heavy or special tasks are delegated to specialized units (SOA), I clearly see this asynchronous idea as an ongoing paradigm shift. Today’s bottleneck is the I/O, most often represented by the database.

Don’t forget that the backend performances aim only to serve more people, faster but 80-90 % of the time is spent on the client side. I’m a frontend guy after all.

Yoan Blanc’s weblog

Going Async

About

Misc