install spark kernel and pyspark kernel using toree

Jupyter Notebook (formerly known as IPython Notebook) is an interactive notebook environment which supports various programming languages which allows you to interact with your data, combine code with markdown text and perform simple visualizations.

Here are just a couple of reasons why using Jupyter Notebook with Spark is the best choice for users that wish to present their work to other team members or to the public in general:

  • Jupyter notebooks support tab autocompletion on class names, functions, methods and variables.
  • It offers more explicit and colour-highlighted error messages than the command line IPython console.
  • Jupyter notebook integrates with many common GUI modules like PyQt, PyGTK, tkinter and with a wide variety of data science packages.
  • Through the use of kernels, multiple languages are supported.

Using Jupyter notebook with Apache Spark is sometimes difficult to configure, particularly when dealing with different development environments. Apache Toree is our solution of choice to configure Jupyter notebooks to run with Apache Spark, it really helps simplify the installation and configuration steps. Default Toree installation works with Scala, although Toree does offer support for multiple kernels including PySpark. We will go over both configurations.

Note: 3Blades offers a pre-built Jupyter Notebook image already configured with PySpark. This tutorial is based in part on the work of Uri Laserson.

Apache Toree with Jupyter Notebook

This should be performed on the machine where the Jupyter Notebook will be executed. If it’s not run on a Hadoop node, then the Jupyter Notebook instance should have SSH access to the Hadoop node.

This guide is based on:

  • IPython 5.1.0
  • Jupyter 4.1.1
  • Apache Spark 2.0.0

The difference between ‘IPython’ and ‘Jupyter’ can be confusing. Basically, the Jupyter team has renamed ‘IPython Notebook’ as ‘Jupyter Notebook’, however the interactive shell is still known as ‘IPython’. Jupyter Notebook ships with IPython out of the box and as such IPython provides a native kernel spec for Jupyter Notebooks.

In this case, we are adding a new kernel spec, known as PySpark.

IPython, Toree and Jupyter Notebook

1) We recommended running Jupyter Notebooks within a virtual environment. This avoids breaking things on your host system. Assuming python / python3 are installed on your local environment, run:

$ pip install virtualenv

2) Then create a virtualenv folder and activate a session, like so:

$ virtualenv -p python3 env3
$ source env3/bin/activate

3) Install Apache Spark and set environment variable for SPARK_HOME. (The official Spark site has options to install bootstrapped versions of Spark for testing.) We recommend downloading the latest version, which as of this writing is Spark version 2.0.0 with Hadoop 2.7.

$ export SPARK_HOME="/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7/bin"

4) (Optional)Add additional ENV variables to your bash profile, or set them manually with exports:

$ echo "PATH=$PATH:/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7/bin" >> .bash_profile
$ echo "SPARK_HOME=/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7" >> .bash_profile

  • Note: substitute .bash_profile for .bashrc or .profile if using Linux.

4) Install Jupyter Notebook, which will also confirm and install needed IPython dependencies:

$ pip install jupyter

5) Install Apache Toree:

$ pip install toree (may fail because: )

6) Configure Apache Toree installation with Jupyter:

$ jupyter toree install --spark_home=$SPARK_HOME

7) Confirm installation:

$ jupyter kernelspec list
Available kernels:
python3 /Users/myuser/Library/Jupyter/kernels/python3
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala

Launch Jupyter Notebook:

$ jupyter notebook