Airflow

Setting up an Apache Airflow Cluster

In one of our previous blog posts, we described the process you should take when Installing and Configuring Apache Airflow.  In this post, we will describe how to setup an Apache Airflow Cluster to run across multiple nodes. This will provide you with more computing power and higher availability for your Apache Airflow instance. Airflow Daemons A running instance of Airflow has a number of Daemons that work together to provide the full functionality of Airflow. The daemons include the Web Server, Scheduler, Worker, Kerberos Ticket Renewer, Flower and others. Bellow are the primary ones you will need to have running for a production quality Apache Airflow Cluster. Web Server A daemon which accepts HTTP requests and allows you to interact with Airflow via a Python Flask Web Application. It provides the ability to pause, unpause DAGs, manually trigger DAGs, view running DAGs, restart failed DAGs and much more. The Web Server Daemon starts up…

Big Data

Understanding Resource Allocation configurations for a Spark application

Resource Allocation is an important aspect during the execution of any spark job. If not configured correctly, a spark job can consume entire cluster resources and make other applications starve for resources. This blog helps to understand the basic flow in a Spark Application and then how to configure the number of executors, memory settings of each executors and the number of cores for a Spark Job. There are a few factors that we need to consider to decide the optimum numbers for the above three, like: The amount of data The time in which a job has to complete Static or dynamic allocation of resources Upstream or downstream application   Introduction   Let’s start with some basic definitions of the terms used in handling Spark applications. Partitions : A partition is a small chunk of a large distributed data set. Spark manages data using partitions that helps parallelize data processing with minimal data…