- Download the Spark Source of the version that you want in your CDH. This can be done either by getting the code from the git repository or just by downloading the source of the specific version from the apache spark site. Here is how to get the source for a specific version from git
To get the latest code from master use below
git clone git://github.com/apache/spark.git
To get a specific maintenance branch with stability fixes on top of Spark 1.5
git clone git://github.com/apache/spark.git -b branch-1.5
- cd /spark
- execute the below command to build spark libraries from source
make-distribution.sh -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0-cdh5.5.0 -DskipTests -Phive -Phive-thriftserver
- Spark needs to be built with -Phive and -Phive-thriftserver options to get the thrift server as part of the distribution
- Maven versions needs to be 3.3 or higher and need to make sure Maven has enough heap memory (1-2 GB) allotted via Xmx settings
- it will take around 15 -30 mins and will the spit the libraries into the folder /dist
spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-1.5.1-cdh5.5.0-yarn-shuffle.jar
- Stop the Spark service on your cluster.
- Copy the three jar files to all machines in your cluster. I copied them to /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars
- Do this for all the nodes on the cluster
- Log onto each nodes via shell and run the below commands
- Change directory to /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/lib
cd /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/lib
- update the spark-assembly jar
sudo rm spark-assembly.jar sudo rm spark-assembly-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar sudo ln -s ../../../jars/spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar sudo ln -s ../../../jars/spark-assembly-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-assembly.jar
- update the spark examples jar
sudo rm spark-examples-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar sudo rm spark-examples.jar sudo ln -s ../../../jars/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar sudo ln -s ../../../jars/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar spark-examples.jar
- update yarn-shuffle jar
cd /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hadoop-yarn/lib
sudo rm spark-1.5.0-cdh5.5.0-yarn-shuffle.jar
sudo rm spark-yarn-shuffle.jar
sudo ln -s ../../../jars/spark-1.5.1-cdh5.5.0-yarn-shuffle.jar
sudo ln -s ../../../jars/spark-1.5.1-cdh5.5.0-yarn-shuffle.jar spark-yarn-shuffle.jar
cd /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/examples/lib
sudo rm spark-examples-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar
sudo ln -s ../../lib/spark-examples-1.5.1-hadoop2.6.0-cdh5.5.0.jar
- List out the symlinks on all the folders and make sure none of them are blinking, blinking indicates that the symlinks might be pointing to the wrong location
- Restart the spark service from cloudera manager
- Copy the thrift server start and stop scripts from the /dist/sbin folder to the location /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/sbin/ on all the nodes that we want to start the thrift server from
- Start the thriftsever with the command
./sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10001
- If JAVA_HOME environment variable is not set , you will need to set it before starting the server.