Cloudera provides Apache Hadoop-based software, support and services, as well as training to business customers. Their open source Apache Hadoop distribution, CDH (Cloudera Distribution Including Apache Hadoop), targets enterprise-class deployments of that technology.
There are two methods of using Anaconda on an existing cluster with Cloudera CDH, Cloudera’s distribution including Apache Hadoop:
- Use the Anaconda parcel for Cloudera CDH. The following procedure describes how to install the Anaconda parcel on a CDH cluster using Cloudera Manager. The Anaconda parcel provides a static installation of Anaconda, based on Python 2.7, that can be used with Python and PySpark jobs on the cluster.
- Use Anaconda Scale, which provides additional functionality, including the ability to manage multiple conda environments and packages, including Python and R, alongside an existing CDH cluster. For more information, see Using Anaconda with Cloudera CDH.
See the blog post Self-service Open Data Science: Custom Anaconda parcels for Cloudera.
To install the Anaconda parcel:
In the Cloudera Manager Admin Console, in the top navigation bar, click the Parcels icon.
At the top right of the parcels page, click the Edit Settings button.
In the Remote Parcel Repository URLs section, click the plus symbol, and then add the following repository URL for the Anaconda parcel:
At the top of the page, click the Save Changes button.
In the top navigation bar, click the Parcels icon to return to the list of available parcels, where you should see the latest version of the Anaconda parcel that is available.
To the right of the Anaconda parcel listing, click the Download button.
After the parcel is downloaded, click the Distribute button to distribute the parcel to all of the cluster nodes.
After the parcel is distributed, click the Activate button to activate the parcel on all of the cluster nodes.
When prompted, confirm the activation.
After the parcel is activated, Anaconda is available on all of the cluster nodes.
You can submit Spark jobs along with the
variable that refers to the location of Anaconda. For example, enter the following command all on one line:
PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/ python spark-submit pyspark_script.py
The line break in the example above is for readability only. Enter the command all on one line.
The repository URL shown above installs the most recent version of the Anaconda parcel. To install an older version of the Anaconda parcel, add the following repository URL to the Remote Parcel Repository URLs in Cloudera manager, and then follow the above steps with your desired version of the Anaconda parcel.
Anaconda builds new Cloudera parcels at least once a year each spring and also offers custom parcel creation for our enterprise customers. The Anaconda parcel provided at the repository URL shown above is based on Python 2.7. To use the Anaconda parcel with other versions of Python or with additional packages, contact firstname.lastname@example.org for more information about custom Anaconda parcel builds or other enterprise solutions for using Anaconda with cluster computing.
Anaconda Workgroup and Anaconda Enterprise subscribers can also use Anaconda Repository to create and distribute their own custom Anaconda parcels for Cloudera Manager.
For more information about managing Cloudera parcels, see the Cloudera documentation.