Create custom Jupyter kernel for Pyspark (AEN 4.1.3)#

These instructions add a custom Jupyter Notebook option to allow users to select PySpark as the kernel.

Install Spark

The easiest way to install Spark is with Cloudera CDH.

You will use YARN as a resource manager. After installing Cloudera CDH, install Spark. Spark comes with a PySpark shell.

Create a notebook kernel for PySpark

You may create the kernel as an administrator or as a regular user. Read the instructions below to help you choose which method to use.

1. As an administrator

Create a new kernel and point it to the root env in each project. To do so create a directory ‘pyspark’ in /opt/wakari/wakari-compute/share/jupyter/kernels/.

Create the following kernel.json file:

{"argv": ["/opt/wakari/anaconda/bin/python",
 "-m", "ipykernel", "-f", "connection_file}", "--profile", "pyspark"],
 "display_name":"PySpark",  "language":"python" }

You may choose any name for the ‘display_name’.

This configuration is pointing to the python executable in the root environment. Since that environment is under admin control, users cannot add new packages to the environment. They will need an admin to help update the environment.

2. As a regular user

Create a new directory in the user’s home directory: .local/share/jupyter/kernels/pyspark/. This way the user will be using the default environment and able to upgrade or install new packages.

Create the following kernel.json file:

{"argv": ["/projects/<username>/<project_name>/envs/default/bin/python",
 "-m", "ipykernel", "-f", "connection_file}", "--profile", "pyspark"],
 "display_name":"PySpark",  "language":"python" }

NOTE: Replace “<username>” with the correct user name and “<project_name>” with the correct project name.

You may choose any name for the ‘display_name’.

Create an iPython profile

The above profile call from the kernel requires that we define a particular PySpark profile. This profile should be created for each user that logs in to AEN to use the PySpark kernel.

In the user’s home, create the directory and file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py with the file contents:

import os
import sys

# The place where CDH installed spark, if the user installed Spark locally it can be changed here.
# Optionally we can check if the variable can be retrieved from environment.

os.environ["SPARK_HOME"] = "/usr/lib/spark"

os.environ["PYSPARK_PYTHON"] = "/opt/wakari/anaconda/bin/python"

# And Python path
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")  #10.4-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

os.environ["PYSPARK_SUBMIT_ARGS"] = "--name yarn pyspark-shell"

Now log in using the user account that has the PySpark profile.

When creating a new notebook in a project, now there will be the option to select PySpark as the kernel. When creating such a notebook you’ll be able to import pyspark and start using it:

from pyspark import SparkConf
from pyspark import SparkContext

NOTE: You can always add those lines and any other command you may use frequently in the PySpark setup file 00-pyspark-setup.py as shown above.