CDH is normally on a six month release cycle and Spark is on 4 month release cycle. This often leads to an extra latency before a spark version gets integrated and supported in CDH. At times Cloudera might even choose to skip a couple of Spark versions. Reason for this latency seems to be the effort needed with integration testing and bugs to fix in Spark and other projects to get it all work together in time.
But if you are like our teams, and can’t wait to get your hands on latest and greatest of both spark and CDH, you can always run the latest version of Spark on CDH. We encountered a similar scenario recently, one of our clients wanted to use the Spark Thrift Server on CDH5.5, but Spark Thrift JDBC/ODBC server is not included on CDH5.5. We figured this might be a common use case for many of you who wants to be on the latest and greatest of Spark quickly, so we decided to put together the steps to install latest version of Spark on CDH.
- Download the Spark Source of the version that you want in your CDH. This can be done either by getting the code from the git repository or just by downloading the source of the specific version from the apache spark site. Here is how to get the source for a specific version from git
To get the latest code from master use below
git clone git://github.com/apache/spark.git
To get a specific maintenance branch with stability fixes on top of Spark 1.5
git clone git://github.com/apache/spark.git -b branch-1.5
- cd /spark
- execute the below command to build spark libraries from source
make-distribution.sh -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0-cdh5.5.0 -DskipTests -Phive -Phive-thriftserver
- Spark needs to be built with -Phive and -Phive-thriftserver options to get the thrift server as part of the distribution
- Maven versions needs to be 3.3 or higher and need to make sure Maven has enough heap memory (1-2 GB) allotted via Xmx settings
- it will take around 15 -30 mins and will the spit the libraries into the folder /dist
spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-1.5.1-cdh5.5.0-yarn-shuffle.jar
- Stop the Spark service on your cluster.
- Copy the three jar files to all machines in your cluster. I copied them to /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars
- Do this for all the nodes on the cluster
- Log onto each nodes via shell and run the below commands
- Change directory to /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/lib
cd /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/lib
- update the spark-assembly jar
sudo rm spark-assembly.jar sudo rm spark-assembly-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar sudo ln -s ../../../jars/spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar sudo ln -s ../../../jars/spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-assembly.jar
- update the spark examples jar
sudo rm spark-examples-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar sudo rm spark-examples.jar sudo ln -s ../../../jars/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar sudo ln -s ../../../jars/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-examples.jar
- update yarn-shuffle jar
cd /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hadoop-yarn/lib
sudo rm spark-1.5.0-cdh5.5.0-yarn-shuffle.jar
sudo rm spark-yarn-shuffle.jar
sudo ln -s ../../../jars/spark-1.5.1-cdh5.5.0-yarn-shuffle.jar
sudo ln -s ../../../jars/spark-1.5.1-cdh5.5.0-yarn-shuffle.jar spark-yarn-shuffle.jar
cd /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/examples/lib
sudo rm spark-examples-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar
sudo ln -s ../../lib/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar
- List out the symlinks on all the folders and make sure none of them are blinking, blinking indicates that the symlinks might be pointing to the wrong location
- Restart the spark service from cloudera manager
- Copy the thrift server start and stop scripts from the /dist/sbin folder to the location /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/sbin/ on all the nodes that we want to start the thrift server from
- Start the thriftsever with the command
./sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10001
- If JAVA_HOME environment variable is not set , you will need to set it before starting the server.
That should be it, you are all set to start playing on the latest version of Spark on your CDH. Please note, installing latest version of spark like this might put you out of compliance for Cloudera support in-case of any bugs or issues that might arise. If Cloudera support is important for you, you should check with your account rep to understand the implications.
Pingback: How To Rebuild Cloudera’s Spark | Team Clairvoyant
Pingback: How To Rebuild Cloudera’s Spark | A WordPress Site
Pingback: How To Rebuild Cloudera’s Spark | The Razor's Edge