Building a Raspberry Pi Python cluster with IPython
In a previous article, we looked at how to combine multiple Raspberry Pis together to make a cluster of machines. But how can you make use of this small monster that you have created?
Luckily, IPython is a great way to put all of these machines to work, getting some serious work done. IPython started out as an enhanced interface for Python, replacing the usual interactive interface that is available. The initial design involved creating a client/server system for IPython, which quickly allowed for some very interesting functionality.
One of these extra capabilities is the ability to run multiple IPython servers that a single IPython client can connect to. You can then connect to these IPython servers and run processes in parallel.
This first step is to install IPython on your Raspberry Pis. The following command will take
care of that for you, and will need to be run on each node of your Pi cluster.
sudo apt-get install ipython ipython-notebook
This will install the version of IPython which uses Python 2.X. If you are planning on using Python 3.X, you will need to install the packages ipython3 and ipython3-notebook instead. Starting with version 4.0 of IPython, the parallel functionality has been pulled out into its own package called ipyparallel. Most distributions do not have it available within their package management systems. This means that you will need to install it with pip, as shown:
sudo pip install ipyparallel
Depending on the versions of the dependencies on your system, pip may update some of these packages. Ipyparallel is structured in four sections: the engine, the hub, the scheduler and the client. The engines do the actual running of the code from your program. The client is where your code runs and where the parallel calls are made.
The hub and scheduler are the parts that let the client and server communicate with each other. Once everything is installed, you can start up the various parts of the parallel parts of IPython. Starting a set of engines on your Raspberry Pi can be done with the command ipcluster, just run the following code:
ipcluster start -n 4
This will start up a controller and a set of four engines. An IPython controller is made up of one hub and a series of schedulers. When you run this command, it stays attached to the starting shell process so that it can write out messages from the controller. If you want to just start up the cluster and leave again, don’t forget to put an ampersand (&) at the end of the command so that it ends up being run in the background. As a
first test, you can create a client object and connect to these engines with code like this:
>>> import ipyparallel as ipp
>>> c = ipp.Client()
[0, 1, 2, 3]
>>> c[:].apply_sync(lambda : “Hello World”)
[‘Hello World’, ‘Hello World’, ‘Hello World’, ‘Hello
When you instantiate a new Client object with ipp. Client(), with no input parameters, it will look in the directory ~/.ipython/profile_default/security to see where to connect to in order to run your parallel code.
The next statement gives you a list of the IDs for the available engines. The final statement uses the method apply_sync to run the given lambda function on some subset of the available engines.
In the previous example, we actually ran the code on the entire list of available engines. This all works fine on a single Raspberry Pi, but that is not what we are interested in. How
do you use that entire cluster that you have set up and waiting to work for you? There are a few options available on how to organise all of the different processes so that they can all communicate with each other. The simplest method is if you keep the controller and the engines together on the same machine and use ipcluster to run everything. By default, ipcluster sets things up so that only local connections will be accepted. In order to accept connections from external machines, you will want to use IPython to
create a new profile, as shown:
ipython profile create –parallel –profile=myprofile
This will create a new profile directory in ~/. ipython/profile_myprofile with a set of configuration files. In order to accept incoming connections, you need to tell the controller what interface to listen on. You will need to edit the file ipcontroller_config.py in the profile directory and add the following line for the HubFactory.
Now when you start your cluster, you can hand in a profile name to use the altered configuration options, as shown.
ipcluster start –profile=myprofile
Within your code, you need to tell it where the controller is running. If the two machines are sharing a filesystem (maybe over NFS, for example), you can use the profile parameter when you instantiate a new client object.
>>> c = ipp.Client(profile=’myprofile’)
If they don’t share a filesystem, you need to configure the client before making a connection. On the Raspberry Pi that is hosting the controller, you will find a file named ipcontroller-client.json in the directory ~/.ipython/profile_myprofile/ security, which will contain all of the connection details that the client will need.
If you copy it to the client machine that will run your code and place it in the directory ~/.ipython/profile_ default/security, you can again instantiate new client objects without any parameters and it will read the connection details from this JSON file. We are still only using a single Raspberry Pi to run engines on.
What about the rest of your cluster? In order to pull these in, you will need to use the individual ipcontroller and ipengine commands, rather than the all-in-one ipcluster command. Since the controller needs to accept connections from clients and engines that are running on remote machines, we will continue to use the IP configuration option that we set earlier. You can then just run the command ipcontroller to start it up.
The next step is to run the ipengine command on each of the Raspberry Pis that are going to form the pool of computational engines. In the directory ~/.ipython/profile_myprofile/security
you will find a second file named ipcontrollerengine.json. This file needs to copied to each of the Raspberry Pis within your cluster so that they will have all of the connection details necessary to be able to communicate with the controller. On each of the Pis, you can start the ipengine process, pointing it to the ipcontroller-engine.json file that you just copied over.
If you want each Pi to host multiple engines, then you will need to start a separate ipengine process for each one. So, now you have IPython engines running on a cluster of Raspberry Pis, with one of them hosting a controller. You can get your Python code to connect to this cluster to parallelise your work, but how can we use all of this power? One of the first steps is to grab a view of some, or all, of the available IPython engines to work with. This is done using the same syntax as that used when slicing lists.
Once you have a view, you can use one of the apply methods to get a function to run on the selected engines:
>>> v = c[0:2]
>>> v.apply_sync(lambda : 42)
[42, 42, 42]
This is a blocking version of the apply method. If the function handed in is going to take time, you can use the method apply_async(), which will run your function on the engine pool asynchronously and return an AsyncResult object that you can use to retrieve the results once they’re finished. If you have data that needs to be processed, you can use the map method to distribute the data to the IPython engines and run the given function on each chunk.
>>> v.map_sync(lambda x : x*2, range(10))
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
As with the apply methods, there is an asynchronous version of the map method, too.
You can also move data around with methods from the view object. If all of the engines need a complete set of the data being used, you can use the push method to move data out to the engines, and pull to read the data off the engines again. If you want to use the dictionary interface to verify the data transfer, running the following code:
>>> dict1 = dict(a=’foo’, b=’bar’)
>>> v2 = c[2:]
You can use scatter and gather to partition your data giving each engine a chunk.
Hopefully, this article has inspired you to dig out all of those Raspberry Pis that have been sitting around collecting dust and create your own cluster for big-time calculations. The nice thing about using IPython is that you can easily add or remove machines to the pool that is available for running Python engines.
We are recommending IPython Interactive Computing and Visualization Cookbook to learn more.