Continuous Delivery With GoCD

This blog outlines our experience moving one of our projects to Continuous delivery Model using GoCD on AWS.

Prior to this implementation our code deployments were manual and on demand. We were looking for automated way of deploying code to various environments with minimal manual intervention. GoCD has continuous delivery as a first-class concept and provides an intuitive interface to start building CD pipelines. We started off with a quick PoC to validate some of our understanding and after initial success, we now use GoCD to define all of our deployment/delivery pipelines.

This move forced us to have comprehensive test suite and work flow defined to set the criteria to promote code in different environments. The change also increased our ability to push more but smaller changes frequently.

Deployment vs Delivery

It is fairly often to see the terms Continuous Delivery and Continuous Deployment used interchangeably. For some this is a huge distinction and for some this does not matter.

Continuous Deployment provides the ability to automatically release the new features and changes to production quickly as the code is checked-in. This typically means there are no process related gating functions between code being checked in and a release of that software making it to production. The only(simplification) gating function is whether or not the automated test suite has passed. Any code defect will lead to test failures and will force the deployment to fail and stop, so it is important to write integration tests with maximum scenario coverage in order to proceed towards continuous deployment. Continuous Delivery, while similar to Continuous Deployment differs in one major aspect – that automation goes as far as the process within an organization allows – and then relies on human or other approval processes to then deploy to production. Continuous integration and continuous delivery are a prerequisite for continuous deployment. GoCD is a tool which provides us the ability to create pipelines to accomplish continuous delivery.

Now that we have that squared away, let’s look at GoCD and how we can use it in a bit more depth.

Key GoCD Concepts

  • Environment
  • Pipeline
  • Stage
  • Job
  • Task
  • Go Server
  • Go Agent

Each artifact (service) being deployed to various environments in the form of pipelines can form a pipeline group. A pipeline within a pipeline group will deploy the artifact to an environment (like DEV or QA). A pipeline consists of various stages; each stage consists of jobs which execute in parallel and lastly, each job consists of tasks which execute sequentially. Pipelines can be shared among multiple artifact pipelines using a pipeline template”For example, QA pipelines of 2 different applications – ServiceA and ServiceB can share same pipeline template. Environment variables and other properties can be shared between pipelines belonging to same environment in form of GoCD “environments“. For instance, environment variables created in GoCD DEV environment are available to all GoCD DEV pipelines.

Go-Server and Go-Agent are entities which together provide us the ability to form and run pipelines. Go-server allows us to create pipelines, maintain configurations and any addition data which composes our pipelines. Go-Agents take command from Go-Server and execute the stages of pipeline.

Sample Pipeline Workflow

Below diagram depicts the sample workflow created using GoCD tool. Source control system used is subversion and the deployment is on AWS, but other deployments are similar and instructions can be found https://gocd.io website. This workflow shows deployment through 2 environments : Dev and QA. Once the code gets checked-in, it is picked up by the Build stage of DEV pipeline, which is polling for the changes on trunk. Four stages are set up for pipeline on each environment. In DEV follow the order of build, deployment, integration testing and build promotion for next environment. For QA environment, artifact promoted in DEV is used as the material (input to pipeline).

GoCD Server and Agent Installation on AWS

On the AWS server, go to the location you want to download the files and run these commands in order

$ wget https://download.gocd.io/binaries/16.12.0-4352/rpm/go-server-16.12.0-4352.noarch.rpm
$ sudo rpm -i go-server-16.12.0-4352.noarch.rpm
$ wget https://download.gocd.io/binaries/16.12.0-4352/rpm/go-agent-16.12.0-4352.noarch.rpm
$ sudo rpm -i go-agent-16.12.0-4352.noarch.rpm

For simplicity, we are installing go-server and go-agent on the same machine, but it can exist of different servers. Once installed, go-server can be accessed at https://{go-server-ip}:8154/go/

GoCD provides set of commands to start or stop the server and agents which are located in init.d folder. Go-agent and go-server can be controlled using these commands:

$ /etc/init.d/go-agent [start|stop|status|restart]

By default, go process runs as a ‘go’ user. After startup the server can be accessed using these urls:

https://{go-server-ip}:8154/go [secured]
http://{go-server-ip}:8183/go   [unsecured]

If the server is running, you will find a default pipelines page. Also, go-agent is registered and can be located using this url: https://{go-server-ip}:8154/go/agents

If you want to have separate agents addressing different environments, check the agent you want to modify, click environments and make changes.

Pipeline Setup

Access pipelines admin page : {go-server}:8154/go/admin/pipelines

  • Create a new pipeline group, then add pipeline within this pipeline group.
  • Add material information, for DEV pipeline, it should be one or more svn (or other version controlled) artifact. For consecutive pipelines, it should be a pipeline stage from parent pipeline.

  • Add stages – Build, Deploy, Test, Promote (or different based on requirement)
  • Within stages, add jobs and then add tasks within each job.
  • Task can be one of the primary task types from dropdown list, or execute shell scripts using the “Script Executor” plugin.

  • From the home page, click environments and add a new Pipeline Environment if none exists using “Add A New Environment“.

  • Make sure Pipeline Environment has an agent associated with it in order for any pipeline within it to run. Agents can be added via ‘Add Agents’ tab when you create environment or can be managed using ‘Agents’ tab, as shown above.
  • Once environment is create, go to ‘Environments’ tab from home page, and click on the environment created. It appears like this:

  • Environment can have all the environment variables which are shared by pipelines within it. Also, pipeline can have its own environment variables.
  • If the pipeline is to an automatic one, make sure to check the ‘ ‘ box in pipeline edit page. Make it unchecked for manual deploys, and add cron time for time-scheduled deploys. For example, to run a pipeline everyday at 8 am, use this inside edit pipeline timer settings : 0 0 8 * * ? More documentation on using cron timers is here: https://docs.gocd.io/current/configuration/configuration_reference.html#timer
  • New pipeline is by default in paused state. Once pipeline is created, go to pipelines tab, and click ‘pause’ button to un-pause the pipeline.

Extracting Templates from existing pipelines and Creating pipelines from Template

  • If the pipelines are to be reused, go to admin/pipelines page and click ‘Extract Template’ for the template to extract.

  • To use extracted template, instead of creating stages, click “Use Template” when you create a new pipeline.

Plugin Installation

Script executor plugin used above can be downloaded from here. There are lots of other plugins too provided by GoCD.

To install above plugin, download only jar from releases and place it here: <go-server-location>/plugins/external & restart Go Server.

GoCD Directories

Listed below are the default locations.

Go agent/server config:
$ /etc/default/go-agent
$ /etc/default/go-server
Log path:
$ /var/log/go-agent
Go Server location:
/var/lib/go-server/

Adding new user for gocd server access

There are multiple mechanisms to setup authentication with GoCD. We used File based authentication.

Steps to set up file based authentication:

  1. Create a ‘.passwd’ on GoCD server, say /users/go/.passwd
  2. Add users to this file from command line like this:
    $htpasswd -s .passwd username

    The user is prompted for the password. The password will be encrypted using the modified Apache MD5 algorithm. If the file does not exist, htpasswd will do nothing except return an error. -s to force htpasswd to use SHA1.

  3. Go to Admin page, and click on Server Configuration tab.
  4. Under User Management section, find the ‘Password File Settings’ and mention the path  /users/go/.passwd, then click Save.

Managing Disk Space

GoCD usually needs 1GB of free space, absence of which will lead to GoCD complaining about the disk space issues. There are various ways which can help with freeing up the disk space. Below is what we follow :

  • Make sure ‘Clean Working directory’ is checked in all stages (see pipeline->stage->’Stage Settings’)
  • Deleting pipelines folder from /var/lib/go-agent. It is temporary working directory to maintain pipelines data by go-agent.
  • Compressing logs
  • Moving go-server pipeline artifacts config to new location with higher disk space (like a different mount location). Update ‘artifactsDir’ in Admin→Config XML to this location.

Best Practices to adapt CD

  • Do trunk based deployments, and promote same binaries across all environments.
  • When starting a new service or project, develop on the dev branch or feature branch for the group. Merge with trunk when ready, but use trunk when doing any new deployments after that.
  • Merge your branch to trunk frequently, with set of thorough tests. Update the dev branch often with trunk during development. This ensures less time merging with master.
  • Add Tests (unit, integration, smoke, regression, etc.), without tests CD will always put us on risk of promoting builds with bugs.
  • Use manual pipelines/interference only on rare scenarios. Embrace CD as often as possible.

Understanding Resource Allocation configurations for a Spark application

Resource Allocation is an important aspect during the execution of any spark job. If not configured correctly, a spark job can consume entire cluster resources and make other applications starve for resources.

This blog helps to understand the basic flow in a Spark Application and then how to configure the number of executors, memory settings of each executors and the number of cores for a Spark Job. There are a few factors that we need to consider to decide the optimum numbers for the above three, like:

  • The amount of data
  • The time in which a job has to complete
  • Static or dynamic allocation of resources
  • Upstream or downstream application

 

Introduction

 

Let’s start with some basic definitions of the terms used in handling Spark applications.

Partitions : A partition is a small chunk of a large distributed data set. Spark manages data using partitions that helps parallelize data processing with minimal data shuffle across the executors.

Task : A task is a unit of work that can be run on a partition of a distributed dataset and gets executed on a single executor. The unit of parallel execution is at the task level.All the tasks with-in a single stage can be executed in parallel

Executor : An executor is a single JVM process which is launched for an application on a worker node. Executor runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. A single node can run multiple executors and executors for an application can span multiple worker nodes. An executor stays up for the
duration of the Spark Application and runs the tasks in multiple threads. The number of executors for a spark application can be specified inside the SparkConf or via the flag –num-executors from command-line.

Cluster Manager : An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN). Spark is agnostic to a cluster manager as long as it can acquire executor processes and those can communicate with each other.We are primarily interested in Yarn as the cluster manager. A spark cluster can run in either yarn cluster or yarn-client mode:

yarn-client mode – A driver runs on client process, Application Master is only used for requesting resources from YARN.

yarn-cluster mode – A driver runs inside application master process, client goes away once the application is initialized

Cores : A core is a basic computation unit of CPU and a CPU may have one or more cores to perform tasks at a given time. The more cores we have, the more work we can do. In spark, this controls the number of parallel tasks an executor can run.

 

 

Steps involved in cluster mode for a Spark Job

  1. From the driver code, SparkContext connects to cluster manager (standalone/Mesos/YARN).
  2. Cluster Manager allocates resources across the other applications. Any cluster manager can be used as long as the executor processes are running and they communicate with each other.
  3. Spark acquires executors on nodes in cluster. Here each application will get its own executor processes.
  4. Application code (jar/python files/python egg files) is sent to executors
  5. Tasks are sent by SparkContext to the executors.

 

From the above steps, it is clear that the number of executors and their memory setting play a major role in a spark job. Running executors with too much memory often results in excessive garbage collection delays

Now we try to understand, how to configure the best set of values to optimize a spark job.

There are two ways in which we configure the executor and core details to the Spark job. They are:

  1. Static Allocation – The values are given as part of spark-submit
  2. Dynamic Allocation – The values are picked up based on the requirement (size of data, amount of computations needed) and released after use. This helps the resources to be re-used for other applications.

 

Static Allocation

 

Different cases are discussed varying different parameters and arriving at different combinations as per user/data requirements.

 

Case 1 Hardware – 6 Nodes and each node have 16 cores, 64 GB RAM

First on each node, 1 core and 1 GB is needed for Operating System and Hadoop Daemons, so we have 15 cores, 63 GB RAM for each node

We start with how to choose number of cores:

Number of cores = Concurrent tasks an executor can run

So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to a bad show. So the optimal value is 5.

This number comes from the ability of an executor to run parallel tasks and not from how many cores a system has. So the number 5 stays same even if we have double (32) cores in the CPU

Number of executors:

Coming to the next step, with 5 as cores per executor, and 15 as total available cores in one node (CPU) – we come to 3 executors per node which is 15/5. We need to calculate the number of executors on each node and then get the total number for the job.

So with 6 nodes, and 3 executors per node – we get a total of 18 executors. Out of 18 we need 1 executor (java process) for Application Master in YARN. So final number is 17 executors

This 17 is the number we give to spark using –num-executors while running from spark-submit shell command

Memory for each executor:

From above step, we have 3 executors per node. And available RAM on each node is 63 GB

So memory for each executor in each node is 63/3 = 21GB.

However small overhead memory is also needed to determine the full memory request to YARN for each executor.

The formula for that overhead is max(384, .07 * spark.executor.memory)

Calculating that overhead:  .07 * 21 (Here 21 is calculated as above 63/3) = 1.47

Since 1.47 GB > 384 MB, the overhead is 1.47

Take the above from each 21 above => 21 – 1.47 ~ 19 GB

So executor memory – 19 GB

Final numbers – Executors – 17, Cores 5, Executor Memory – 19 GB

 

Case 2 Hardware – 6 Nodes and Each node have 32 Cores, 64 GB

 

Number of cores of 5 is same for good concurrency as explained above.

Number of executors for each node = 32/5 ~ 6

So total executors = 6 * 6 Nodes = 36. Then final number is 36 – 1(for AM) = 35

Executor memory:

6 executors for each node. 63/6 ~ 10. Overhead is .07 * 10 = 700 MB. So rounding to 1GB as overhead, we get 10-1 = 9 GB

Final numbers – Executors – 35, Cores 5, Executor Memory – 9 GB

 

Case 3 – When more memory is not required for the executors

 

The above scenarios start with accepting number of cores as fixed and moving to the number of executors and memory.

Now for the first case, if we think we do not need 19 GB, and just 10 GB is sufficient based on the data size and computations involved, then following are the numbers:

Cores: 5

Number of executors for each node = 3. Still 15/5 as calculated above.

At this stage, this would lead to 21 GB, and then 19 as per our first calculation. But since we thought 10 is ok (assume little overhead), then we cannot switch the number of executors per node to 6 (like 63/10). Because with 6 executors per node and 5 cores it comes down to 30 cores per node, when we only have 16 cores. So we also need to change number of cores for each executor.

So calculating again,

The magic number 5 comes to 3 (any number less than or equal to 5). So with 3 cores, and 15 available cores – we get 5 executors per node, 29 executors ( which is  (5*6 -1)) and memory is 63/5 ~ 12.

Overhead is 12*.07=.84. So executor memory is 12 – 1 GB = 11 GB

Final Numbers are 29 executors, 3 cores, executor memory is 11 GB

 

Summary Table

 

screen-shot-2016-12-06-at-11-58-15-pm

 

Dynamic Allocation

 

Note: Upper bound for the number of executors if dynamic allocation is enabled is infinity. So this says that spark application can eat away all the resources if needed. In a cluster where we have other applications running and they also need cores to run the tasks, we need to make sure that we assign the cores at cluster level.

 

This means that we can allocate specific number of cores for YARN based applications based on user access. So we can create a spark_user and then give cores (min/max) for that user. These limits are for sharing between spark and other applications which run on YARN.

To understand dynamic allocation, we need to have knowledge of the following properties:

spark.dynamicAllocation.enabled – when this is set to true we need not mention executors. The reason is below:

The static parameter numbers we give at spark-submit is for the entire job duration. However if dynamic allocation comes into picture, there would be different stages like the following:

What is the number for executors to start with:

Initial number of executors (spark.dynamicAllocation.initialExecutors) to start with

 Controlling the number of executors dynamically:

Then based on load (tasks pending) how many executors to request. This would eventually be the number what we give at spark-submit in static way. So once the initial executor numbers are set, we go to min (spark.dynamicAllocation.minExecutors) and max (spark.dynamicAllocation.maxExecutors) numbers.

 When to ask new executors or give away current executors:

When do we request new executors (spark.dynamicAllocation.schedulerBacklogTimeout) – This means that there have been pending tasks for this much duration. So the request for the number of executors requested in each round increases exponentially from the previous round. For instance, an application will add 1 executor in the first round, and then 2, 4, 8 and so on executors in the subsequent rounds. At a specific point, the above property max comes into picture.

When do we give away an executor is set using spark.dynamicAllocation.executorIdleTimeout.

To conclude, if we need more control over the job execution time, monitor the job for unexpected data volume the static numbers would help. By moving to dynamic, the resources would be used at the background and the jobs involving unexpected volumes might affect other applications.

 

References:

http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation

http://spark.apache.org/docs/latest/job-scheduling.html#resource-allocation-policy

https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

http://spark.apache.org/docs/latest/cluster-overview.html