How to...

Install Apache Spark

Available as Ansible Playbook apache-spark-master.yml


  • Raspbian needs to be installed on the master and all the slave-nodes
  • Master and slave nodes need static ip addresses
  • A computer with an internet connection
  • Installed ansible on the master-node

Download Apache Spark

Go to this website:

Choose the latest stable release of Apache Spark. At the time of writing this was version 1.6.2.

Choose as package type “Pre-built for Hadoop 2.x”.

This will download a file named something like “spark-1.6.2-bin-hadoop2.6.tgz”.

Use scp to copy this file to the master node:

scp spark-1.6.2-bin-hadoop2.6.tgz [email protected]

Extract the contents of the package with:

tar xvfz spark-1.6.2-bin-hadoop2.6.tgz

This will create a new directory named “spark-1.6.2-bin-hadoop2.6”

Rename this directory to “spark”:

mv spark-1.6.1-bin-hadoop2.6 spark

While still on the master-node you need to edit two files:

In ~/spark/conf/slaves we define which IPs belong to the spark-cluster. This file only needs to exist on the master but it doesn’t hurt if it is also on the slave-nodes.

Content of ~/spark/conf/slaves:

In ~/spark/conf/ we define the properties of our cluster (like the amount of RAM available to each node).

Content of ~/spark/conf/ on the master-node:

#!/usr/bin/env bash

# SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# Total amount of memory that can be used on one machine
# SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g)

# Although these are settings for running the cluster with the YARN resource manager
# these need to be set or otherwise the master throws an out-of-memory exception

Copy the spark-directory to each node using ansible (this may take a while because the copy-command is really slow):

ansible raspifarm-slaves -m copy -a "src=/home/farmer/spark dest=/home/farmer/" -f 8

IMPORTENT: The spark-directory needs to be in the same directory on the master and the slave-nodes!

Make all files inside the spark-folder executable.

On the master-node:

chmod -R +x spark

On the slave-nodes (automated with ansible):

ansible raspifarm-slaves -a "chmod -R +x spark" -f 8

Adjust log-level for spark

On the master-node:

Open ~/spark/conf/ with your favourite text-editor and change the following line:

log4j.rootCategory=INFO, console


log4j.rootCategory=WARN, console

Starting the spark-cluster manually

Before starting the spark-cluster we need to set some PATHs:

export PATH=$PATH:/home/farmer/spark/bin/

export SPARK_HOME=/home/farmer/spark/

Start the cluster from the master-node:


Using the cluster via shell


~/spark/bin/pyspark --master spark://


~/spark/bin/spark-shell --master spark://
