Create custom Jupyter kernel for Pyspark#

These instructions add a custom Jupyter Notebook option to allow users to select PySpark as the kernel.

Install Spark#

The easiest way to install Spark is with Cloudera CDH.

You will use YARN as a resource manager. After installing Cloudera CDH, install Spark. Spark comes with a PySpark shell.

Create a notebook kernel for PySpark#

You may create the kernel as an administrator or as a regular user. Read the instructions below to help you choose which method to use.

1. As an administrator#

Create a new kernel and point it to the root env in each project. To do so create a directory ‘pyspark’ in /opt/wakari/wakari-compute/share/jupyter/kernels/.

Create the following kernel.json file:

{"argv": ["/opt/wakari/anaconda/bin/python",
 "-m", "ipykernel", "-f", "connection_file}", "--profile", "pyspark"],
 "display_name":"PySpark",  "language":"python" }

You may choose any name for the ‘display_name’.

This configuration is pointing to the python executable in the root environment. Since that environment is under admin control, users cannot add new packages to the environment. They will need an admin to help update the environment.

2. As an administrator without IPython profile#

To have an admin level PySpark kernel without the user .ipython space:

{"argv":
["/opt/wakari/wakari-compute/etc/ipython/pyspark.sh", "-f", "{connection_file}"],
"display_name":"PySpark",  "language":"python" }

NOTE: The pyspark.sh script is defined in Without IPython profile section below.

3. As a regular user#

Create a new directory in the user’s home directory: .local/share/jupyter/kernels/pyspark/. This way the user will be using the default environment and able to upgrade or install new packages.

Create the following kernel.json file:

{"argv": ["/projects/<username>/<project_name>/envs/default/bin/python",
 "-m", "ipykernel", "-f", "connection_file}", "--profile", "pyspark"],
 "display_name":"PySpark",  "language":"python" }

NOTE: Replace “<username>” with the correct user name and “<project_name>” with the correct project name.

You may choose any name for the ‘display_name’.

Create an IPython profile#

The above profile call from the kernel requires that we define a particular PySpark profile. This profile should be created for each user that logs in to AEN to use the PySpark kernel.

In the user’s home, create the directory and file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py with the file contents:

import os
import sys

# The place where CDH installed spark, if the user installed Spark locally it can be changed here.
# Optionally we can check if the variable can be retrieved from environment.

os.environ["SPARK_HOME"] = "/usr/lib/spark"

os.environ["PYSPARK_PYTHON"] = "/opt/wakari/anaconda/bin/python"

# And Python path
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")  #10.4-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

os.environ["PYSPARK_SUBMIT_ARGS"] = "--name yarn pyspark-shell"

Now log in using the user account that has the PySpark profile.

Without IPython profile#

If it is necessary to avoid creating a local profile for the users, a script can be made to be called from the kernel. Create a bash script that will load the environment variables:

sudo -u $AEN_SRVC_ACCT mkdir /opt/wakari/wakari-compute/etc/ipython
sudo -u $AEN_SRVC_ACCT touch /opt/wakari/wakari-compute/etc/ipython/pyspark.sh
sudo -u $AEN_SRVC_ACCT chmod a+x /opt/wakari/wakari-compute/etc/ipython/pyspark.sh

The contents of the file should look like:

#!/usr/bin/env bash
# setup environment variable, etc.

export PYSPARK_PYTHON="/opt/wakari/anaconda/bin/python"
export SPARK_HOME="/usr/lib/spark"

# And Python path
export PYLIB=$SPARK_HOME:/python/lib
export PYTHONPATH=$PYTHONPATH:$PYLIB:/py4j-0.9-src.zip
export PYTHONPATH=$PYTHONPATH:$PYLIB:/pyspark.zip

export PYSPARK_SUBMIT_ARGS="--name yarn pyspark-shell"

# run the ipykernel
exec /opt/wakari/anaconda/bin/python -m ipykernel $@

Using PySpark#

When creating a new notebook in a project, now there will be the option to select PySpark as the kernel. When creating such a notebook you’ll be able to import pyspark and start using it:

from pyspark import SparkConf
from pyspark import SparkContext

NOTE: You can always add those lines and any other command you may use frequently in the PySpark setup file 00-pyspark-setup.py as shown above.