Analytics, Big Data, Hadoop, Spark

Installing SparkR on a Hadoop Cluster


SparkR is an extension to Apache Spark which allows you to run Spark jobs with the R programming language. This provides the benefit of being able to use R packages and libraries in your Spark jobs. In the case of both Cloudera and MapR, SparkR is not supported and would need to be installed separately.

Installation Steps

Here are the steps you can take to Install SparkR on a Hadoop Cluster:

  1. Execute the following steps on all the Spark Gateways/Edge Nodes
    1. Login to the target machine as root
    2. Install R and other Dependencies
      1. Execute the following to Install
        1. Ubuntu
          sh -c 'echo "deb lenny-cran/" >> /etc/apt/sources.list'
          apt-get install r-base r-base-dev
        2. Centos
          1. Install the repo
            rpm -ivh
          2. Enable the repo
            1. Edit the /etc/yum.repos.d/epel-testing.repo file with your favorite text editing software
            2. Change all the enabled sections to ‘1’
          3. Clean yum cache
            yum clean all
          4. Install R and Dependencies
            yum install R R-devel libcurl-devel openssl-devel
      2. Test R installation
        1. Start up an R Session
        2. Within the R Shell, execute an addition command to ensure things are ran correctly
          1 + 1
        3. Quit when you’re done
      3. Note: R libraries gets installed at “/usr/lib64/R”
    3. Get the version of Spark you currently have installed
      1. Run the following command
        spark-submit --version
      2. Example output: 1.6.0
      3. Replace the Placeholder {SPARK_VERSION} with this value
    4. Install SparkR
      1. Start up the R console
      2. Install the Depending R Packages
      3. Install the SparkR Packages
        devtools::install_github('apache/spark@v{SPARK_VERSION}', subdir='R/pkg')
      4. Close out of the R shell
    5. Find the Spark Home Directory and replace the Placeholder {SPARK_HOME_DIRECTORY} with this value
    6. Install the SparkR OS Dependencies
      cd /tmp/
      unzip v{SPARK_VERSION}.zip
      cd spark-{SPARK_VERSION}
      cd bin
      cp sparkR {SPARK_HOME_DIRECTORY}/bin/
    7. Run Dev Install
    8. Create a new file “/user/bin/sparkR” and set the contents
      1. Copy the contents of the /usr/bin/spark-shell file to /usr/bin/sparkR
        cp /usr/bin/spark-shell /usr/bin/sparkR
      2. Edit the /usr/bin/sparkR file. Replace “spark-shell” with “sparkR” on the bottom exec command.
    9. Finish install
      sudo chmod 755 /usr/bin/sparkR
    10. Verify that the sparkR command is available
      cd ~
      which sparkR
    11. Your done!


Upon completion of the installation steps, here are some ways that you can test the installation to verify everything is running correctly.

  • Test from R Console – Run on a Spark Gateway
    1. Start an R Shell
    2. Execute the following commands in the R Shell
      sc = spark_connect(master = "yarn-client")
    3. If this runs without errors then you know it’s working!
  • Test from SparkR Console – Run on a Spark Gateway
    1. Open the SparkR Console
    2. Verify the Spark Context is available with the following command:
    3. If the sc variable is listed then you know it’s working!
  • Sample code you can run to test more
    rdd = SparkR:::parallelize(sc, 1:5)

One comment

  1. Raja

    Hi, I got your post from search. Thanks you very much for your valuable input. I need some clarification, can we use same r packages in sparkR. For example missing value amputation using mice package. Please advice on this.

Leave a Reply

Your email address will not be published. Required fields are marked *