Using Anaconda with Spark

Apache Spark is an analytics engine and parallel computation framework with Scala, Python and R interfaces. Spark can load data directly from disk, memory and other data storage technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, Cassandra and others.

Anaconda Scale can be used with a cluster that already has a managed Spark/Hadoop stack. Anaconda Scale can be installed alongside existing enterprise Hadoop distributions such as Cloudera CDH or Hortonworks HDP and can be used to manage Python and R conda packages and environments across a cluster.

To run a script on the head node, simply execute python on the cluster. Alternatively, you can install Jupyter Notebook on the cluster using Anaconda Scale. See the Installation documentation for more information.

Different ways to use Spark with Anaconda

You can develop Spark scripts interactively, and you can write them as Python scripts or in a Jupyter Notebook.

You can submit a PySpark script to a Spark cluster using various methods:

  • Run the script directly on the head node by executing python on the cluster.
  • Use the spark-submit command either in Standalone mode or with the YARN resource manager.
  • Submit the script interactively in an IPython shell or Jupyter Notebook on the cluster. For information on using Anaconda Scale to install Jupyter Notebook on the cluster, see Installation.

You can also use Anaconda Scale with enterprise Hadoop distributions such as Cloudera CDH or Hortonworks HDP.

Using Anaconda Scale with Spark

The topics listed below describe how to:

  • Use Anaconda and Anaconda Scale with Apache Spark and PySpark
  • Interact with data stored within the Hadoop Distributed File System (HDFS) on the cluster

While these tasks are independent and can be performed in any order, we recommend that you begin with Configuring Anaconda with Spark.