Installing Livy on a Hadoop Cluster

Purpose

Livy is an open source component to Apache Spark that allows you to submit REST calls to your Apache Spark Cluster. You can view the source code here: https://github.com/cloudera/livy

In this post I will be going over the steps you would need to follow to get Livy installed on a Hadoop Cluster. The steps were derived from the above source code link, however, this post provides more information on how to test it in a more simple manner.

Install Steps

  1. Determine which node in your cluster will act as the Livy server
    1. Note: the server will need to have Hadoop and Spark libraries and configurations deployed on them.
  2. Login to the machine as Root
  3. Download the Livy source code
    cd /opt
    wget https://github.com/cloudera/livy/archive/v0.2.0.zip
    unzip v0.2.0.zip
    cd livy-0.2.0
  4. Get the version of spark that is currently installed on your cluster
    1. Run the following command
      spark-submit --version
    2. Example: 1.6.0
    3. Use this value in downstream commands as {SPARK_VERSION}
  5.  Build the Livy source code with Maven
    /usr/local/apache-maven/apache-maven-3.0.4/bin/mvn -DskipTests=true -Dspark.version={SPARK_VERSION} clean package
  6. Your done!

Steps to Control Livy

Get Status

ps -eaf | grep livy

It will  be listed like the following:

root      9379     1 14 18:28 pts/0    00:00:01 java -cp /opt/livy-0.2.0/server/target/jars/*:/opt/livy-0.2.0/conf:/etc/hadoop/conf: com.cloudera.livy.server.LivyServer

Start

Note: Run as Root

cd /opt/livy-0.2.0/
export SPARK_HOME=/usr/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
./bin/livy-server start

Once started, the Livy Server can be called with the following host and port:

http://localhost:8998

If you’re calling it from another machine, then you will need to update “localhost” to the Public IP or Hostname of the Livy server.

Stop

Note: Run as Root

cd /opt/livy-0.2.0/
./bin/livy-server stop

Testing Livy

This assumes you are running it from the machine where Livy was installed. Hence why we’re using localhost. If you would like to test it from another machine, then you just need to change “localhost” to the Public IP or Hostname of the Livy server.

  1. Create a new Livy Session
    1. Curl Command
      curl -H "Content-Type: application/json" -X POST -d '{"kind":"spark"}' -i http://localhost:8998/sessions
    2. Output
      HTTP/1.1 201 Created
      Date: Wed, 02 Nov 2016 22:38:13 GMT
      Content-Type: application/json; charset=UTF-8
      Location: /sessions/1
      Content-Length: 81
      Server: Jetty(9.2.16.v20160414)
      
      {"id":1,"owner":null,"proxyUser":null,"state":"starting","kind":"spark","log":[]}
  2. View Current Livy Sessions
    1. Curl Command
      curl -H "Content-Type: application/json" -i http://localhost:8998/sessions
    2. Output
      HTTP/1.1 200 OK
      Date: Tue, 08 Nov 2016 02:30:34 GMT
      Content-Type: application/json; charset=UTF-8
      Content-Length: 111
      Server: Jetty(9.2.16.v20160414)
      
      {"from":0,"total":1,"sessions":[{"id":0,"owner":null,"proxyUser":null,"state":"idle","kind":"spark","log":[]}]}
  3. Get Livy Session Info
    1. Curl Command
      curl -H "Content-Type: application/json" -i http://localhost:8998/sessions/0
    2. Output
      HTTP/1.1 200 OK
      Date: Tue, 08 Nov 2016 02:31:04 GMT
      Content-Type: application/json; charset=UTF-8
      Content-Length: 77
      Server: Jetty(9.2.16.v20160414)
      
      {"id":0,"owner":null,"proxyUser":null,"state":"idle","kind":"spark","log":[]}
  4. Submit job to Livy
    1. Curl Command
      curl -H "Content-Type: application/json" -X POST -d '{"code":"println(sc.parallelize(1 to 5).collect())"}' -i http://localhost:8998/sessions/0/statements
    2. Output
      HTTP/1.1 201 Created
      Date: Tue, 08 Nov 2016 02:31:29 GMT
      Content-Type: application/json; charset=UTF-8
      Location: /sessions/0/statements/0
      Content-Length: 40
      Server: Jetty(9.2.16.v20160414)
      
      {"id":0,"state":"running","output":null}
  5. Get Job Status and Output
    1. Curl Command
      curl -H "Content-Type: application/json" -i http://localhost:8998/sessions/0/statements/0
    2. Output
      HTTP/1.1 200 OK
      Date: Tue, 08 Nov 2016 02:32:15 GMT
      Content-Type: application/json; charset=UTF-8
      Content-Length: 109
      Server: Jetty(9.2.16.v20160414)
      
      {"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"[I@6270e14a"}}}
  6. Delete Session
    1. Curl Command
      curl -H "Content-Type: application/json" -X DELETE -d -i http://localhost:8998/sessions/0
    2. Output
      {"msg":"deleted"}

Installing SparkR on a Hadoop Cluster

Purpose

SparkR is an extension to Apache Spark which allows you to run Spark jobs with the R programming language. This provides the benefit of being able to use R packages and libraries in your Spark jobs. In the case of both Cloudera and MapR, SparkR is not supported and would need to be installed separately.

Installation Steps

Here are the steps you can take to Install SparkR on a Hadoop Cluster:

  1. Execute the following steps on all the Spark Gateways/Edge Nodes
    1. Login to the target machine as root
    2. Install R and other Dependencies
      1. Execute the following to Install
        1. Ubuntu
          sh -c 'echo "deb http://cran.rstudio.com/bin/linux/debian lenny-cran/" >> /etc/apt/sources.list'
          apt-get install r-base r-base-dev
        2. Centos
          1. Install the repo
            rpm -ivh http://mirror.unl.edu/epel/6/x86_64/epel-release-6-8.noarch.rpm
          2. Enable the repo
            1. Edit the /etc/yum.repos.d/epel-testing.repo file with your favorite text editing software
            2. Change all the enabled sections to ‘1’
          3. Clean yum cache
            yum clean all
          4. Install R and Dependencies
            yum install R R-devel libcurl-devel openssl-devel
      2. Test R installation
        1. Start up an R Session
          R
        2. Within the R Shell, execute an addition command to ensure things are ran correctly
          1 + 1
        3. Quit when you’re done
          quit()
      3. Note: R libraries gets installed at “/usr/lib64/R”
    3. Get the version of Spark you currently have installed
      1. Run the following command
        spark-submit --version
      2. Example output: 1.6.0
      3. Replace the Placeholder {SPARK_VERSION} with this value
    4. Install SparkR
      1. Start up the R console
        R
      2. Install the Depending R Packages
        install.packages("devtools")
        install.packages("roxygen2")
        install.packages("testthat")
      3. Install the SparkR Packages
        devtools::install_github('apache/spark@v{SPARK_VERSION}', subdir='R/pkg')
        install.packages('sparklyr')
      4. Close out of the R shell
        quit()
    5. Find the Spark Home Directory and replace the Placeholder {SPARK_HOME_DIRECTORY} with this value
    6. Install the SparkR OS Dependencies
      cd /tmp/
      wget https://github.com/apache/spark/archive/v{SPARK_VERSION}.zip
      unzip v{SPARK_VERSION}.zip
      cd spark-{SPARK_VERSION}
      cp -r R {SPARK_HOME_DIRECTORY}
      cd bin
      cp sparkR {SPARK_HOME_DIRECTORY}/bin/
    7. Run Dev Install
      cd {SPARK_HOME_DIRECTORY}/R/
      sh install-dev.sh
    8. Create a new file “/user/bin/sparkR” and set the contents
      1. Copy the contents of the /usr/bin/spark-shell file to /usr/bin/sparkR
        cp /usr/bin/spark-shell /usr/bin/sparkR
      2. Edit the /usr/bin/sparkR file. Replace “spark-shell” with “sparkR” on the bottom exec command.
    9. Finish install
      sudo chmod 755 /usr/bin/sparkR
    10. Verify that the sparkR command is available
      cd ~
      which sparkR
    11. Your done!

Testing

Upon completion of the installation steps, here are some ways that you can test the installation to verify everything is running correctly.

  • Test from R Console – Run on a Spark Gateway
    1. Start an R Shell
      R
    2. Execute the following commands in the R Shell
      library(SparkR)
      library(sparklyr)
      Sys.setenv(SPARK_HOME='{SPARK_HOME_DIRECTORY}')
      Sys.setenv(SPARK_HOME_VERSION='{SPARK_VERSION}')
      Sys.setenv(YARN_CONF_DIR='{YARN_CONF_DIRECTORY}')
      sc = spark_connect(master = "yarn-client")
    3. If this runs without errors then you know it’s working!
  • Test from SparkR Console – Run on a Spark Gateway
    1. Open the SparkR Console
      sparkR
    2. Verify the Spark Context is available with the following command:
      sc
    3. If the sc variable is listed then you know it’s working!
  • Sample code you can run to test more
    rdd = SparkR:::parallelize(sc, 1:5)
    SparkR:::collect(rdd)