Fixing an AWS EC2 Instance Boot Up Issue

Background

We recently had a problem with one of our AWS EC2 Instances after shutting it down, making some configuration changes and starting it back up. We were unable to SSH onto the machines despite the fact that the machine came up OK (we would keep getting a Connection Refused error). We reviewed the Security Group settings, Network Settings, reverted our configuration changes, made sure we were pointing to the correct IP address and much more, but we still couldn’t SSH onto the machine.

Upon viewing the system logs, we noticed that one of the disk volumes failed to be mounted onto the machine. It was an Instance Store drive that apparently was remounted onto the machine after restarting it under a different device name. This prevented the boot up from completing, which as a result prevented the sshd daemon from being started up to allow us to SSH onto the machine. With us not being able to SSH onto the machine to effect repairs we were left dead in the water. But we eventually figured a way to allow us to view the file system and make the necessary changes to fix the issue, which is described in this blog post.

In our case it was an issue with the /etc/fstab that caused us to have to follow these steps, but there are other cases where these steps can also benefit you. For example, if you mistakingly configured sshd not to start on startup of the machine or if something else failed to run during boot up which prevented the sshd daemon from starting up.

High Level Process

To resolve this, we’re going to basically unmount the bad machines root file system, mount it to a healthy machine so we can explore the file system and fix the issue, and then remount it back to the original instance.

Step by Step Process

Setup

Suppose we have our EC2 instance (call it prod-instance) which has booted up ok, but we’re unable to SSH onto.

Setup

Steps

  1. Loin to the AWS Web Console
  2. Stop the prod-instance instance
  3. Detach the root EBS volume from the prod-instance
    1. Select the prod-instance EC2 instance in the AWS console and view the content in the “Description” tab in the window bellow the instance list
    2. Search for the “Root device” field
    3. Click on the link next to it
      • It should look something like this: /dev/xvda
      • A dialog box will pop up
        Block Device Modal
    4. Take a note of the EBS ID
      • For the bellow the steps bellow, assume the EBS ID is vol-0c7bf2325c6ab485b
    5. Click on the EBS ID link
      • This will take you to a new list with information on that EBS Volume
        Available Volumes
    6. Make sure the EBS Volume vol-0c7bf2325c6ab485b is selected and click Actions -> Detach Volume
      Attached Volume Actions
    7. If you would like to abort this and reattach the volume, Jump to step #15
  4. Create a brand new micro instance that you’re able to SSH into and let it startup. We’ll call it maintenance-instance.
    • Make sure that its in the same Region and Availability Zone of the machine you detached the root volume from. Volumes cannot switch between availability zones.
    • Note: Be sure you can SSH onto the machine before proceeding forward
      ssh -i {pem_file} {username}@{ec2_host_or_ip}
       Prod Instance Stopped
  5. Mount the prod-instance‘s old root EBS volume to the maintenance-instance as an additional drive
    1. Click on the “Volumes” link on the left side of the AWS EC2 Web Console under ELASTIC BLOCK STORE
    2. Search for the EBS Volume you detached (vol-0c7bf2325c6ab485b). It will also be listed as having the State “available” (as opposed to “in-use”).
      Volume available
    3. Select the volume and click Actions -> Attach Volume
      Detached Volume Actions
    4. This will open a modal
      Attach Volume
    5. Search for your the maintenance-instance and click on the entry
      Instance Added to Attach Volume
      • By clicking on the entry it will put in a default value into the Device field. If it doesn’t, you can put in the value /dev/sdf.
    6. Click Attach
    7. Note: You do not need to stop or restart maintenance-instance before or after attaching the instance. 
  6. SSH onto the maintenance-instance
  7. Login as root
    sudo su
  8. Check the disk to ensure that the prod-instance‘s old root EBS volume is available and get the device name
    1. Run the following command to get information about what volumes are currently mounted (which should only be the default root volume at this point)
      df -h
      • This will produce a result like this:
        Filesystem Size Used Avail Use% Mounted on
        devtmpfs 488M 64K 488M 1% /dev
        tmpfs 498M 0 498M 0% /dev/shm
        /dev/xvda1 7.8G 981M 6.7G 13% /
      • What this tells you is that there is one main drive called /dev/xvda1 which is the root volume of the maintenance-instance. Thus we can ignore this device name.
    2. Run the following command to find out what the device name is of the volume we want to effect repairs on
      lsblk
      • This will produce a result like this:
        NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
        xvda 202:0 0 8G 0 disk
        └─xvda1 202:1 0 8G 0 part /
        xvdf 202:80 0 8G 0 disk
        └─xvdf1 202:81 0 8G 0 part
      • What this tells you, is that there are 2 main disks attached, each with one partition. We’ve already found out that the xvda device is the original root volume of the maintenance-instance, so by process of elimination xvdf is the disk we mounted onto the machine and want to effect repairs on.
    3. Get the device name of the volume you mounted onto the machine
      1. In our case, based off the output above, the device name is: /dev/xvdf1 (which is the partition of that disk)
      2. Note: you may have noticed the device name also available in the AWS console under the Description of the machine and under the Block devices section. However, the value provided in the AWS console isn’t always the same as the one you will see when using the fdisk or lsblk command, so therefore you shouldn’t use this value. Use the one provided in the fdisk or lsblk command.
  9. Create the directory that you want to mount the volume to (this can be named and placed wherever you would like)
    mkdir /badvolume
  10. Mount the drives partition to the directory
    mount /dev/xvdf1 /badvolume
  11. Explore the file system and make the necessary change you would like to it
    • Change directory to the newly mounted file system
      cd /badvolume
    • Note: In our case, since we were dealing with a mounting issue, we had to modify the /etc/fstab file to prevent the machine from trying to mount the volume that was failing. Since theprod-instance‘s root volume was mounted onto the /badvolume directory, the fstab file that we need to fix is at /badvolume/etc/fstab.
      • We simply commented out the bad entry and then moved on
    • When you have completed your repairs, move onto the next step
  12. Unmount the drive from the machine
    umount /badvolume
  13. Switch back to the AWS Web Console
  14. Detach the vol-0c7bf2325c6ab485b volume from the maintenance-instance
    1. Click on the “Volumes” link on the left side of the AWS Web Console under ELASTIC BLOCK STORE
    2. Search for the EBS Volume you detached (vol-0c7bf2325c6ab485b). It will also be listed as having the State “in-use”.
    3. Select the volume and click Actions -> Detach Volume
      Attached Volume Actions
  15. Re-Attach the vol-0c7bf2325c6ab485b volume to the prod-instance as the root volume
    1. Click on the “Volumes” link on the left side of the AWS Web Console under ELASTIC BLOCK STORE
    2. Search for the EBS Volume you detached (vol-0c7bf2325c6ab485b). It will also be listed as having the State “available”.
    3. Select the volume and click Actions -> Attach Volume
      Detached Volume Actions
    4. This will open a modal
      Attach Volume
    5. Search for your the prod-instance
    6. Set the Device as the root volume with the value: /dev/xvda
      Instance Added to Attach Volume
    7. Click Attach
  16. Restart the prod-instance
  17. Test SSH’ing onto the prod-instance
  18. If you’re still having issues connecting to the prod-instance then check the system logs of the machine to debug the problem and, if necessary, repeat these steps to fix the issue with the drive.
  19. When you’re all done you can terminate the maintenance-instance

Creating Custom Origin for Streamsets

Streamsets Data Collector:

StreamSets Data Collector is a lightweight and powerful engine that streams data in real time. It allows you to build continuous data pipelines, each of which consumes record-oriented data from a single origin, optionally operates on those records in one or more processors and writes data to one or more destinations.

Streamsets Origin Stage:

To define the flow of data for Data Collector, you configure a pipeline. A pipeline consists of stages that represents the origin and destination of the pipeline and any additional processing that you want to perform.

An origin stage represents the source for the pipeline.

For example, this pipeline, based on the SDC taxi data tutorial https://streamsets.com/documentation/datacollector/latest/help/#Tutorial/Overview.html which uses the Directory origin, four processors and the Hadoop File System destination:

 

pipeline

 

Stremsets comes bundled with many origin stage components to connect with almost all commonly used data sources and if you don’t find one for your source system, don’t worry  Streamsets APIs are there to help you in creating a customized origin stage for your system.

This blog explains how to get started writing your own custom Streamsets Origin stage to stream records from Amazon SQS(Simple Queue Service).

 Requirements: 

  • Java Installed
  • IDE(Eclipse/Intellij) setup
  • Streamset data collector

Creating and building the origin template

Follow the Streamset Datacollector documentation to download, install and run StreamSets Data Collector.

You will also need to download source for the Data Collector and its API. Just make sure that you have matching versions for the runtime and source, so you might find it easier to download tarballs from the relevant GitHub release pages rather than using git clone:

Build both the Data Collector and its API:

$ cd datacollector-api
$ mvn clean install -DskipTests ...output omitted...
$ cd ../datacollector
$ mvn clean install -DskipTests ...output omitted...

Maven puts the library JARs in its repository, so they’re available when we build our custom origin:

Create Skeleton Project:

Now create a new custom stage project using the Maven archetype:

$ mvn archetype:generate -DarchetypeGroupId=com.streamsets -DarchetypeArtifactId=streamsets-datacollector-stage-lib-tutorial -DarchetypeVersion={version} -DinteractiveMode=true

The above command uses streamsets-datacollector-stage-lib-tutorial maven archetype to create the skeleton project and this is the easiest way to get started developing your own stages.

Provide values for property groupId, artifactId, version and package

Maven generates a template project from the archetype in a directory with the artifactId you provided as its name. As you can see, there is template code for an origin, a processor and a destination:

 

structure

 

Origin template classes: 

In the above figure following are the important classes under Origin stage:

  • Groups.java: Responsible to hold the labels for the configuration tabs in datacollector UI
  • SampleDsource.java: Contains stage and its configurations definitions and assigns those configurations to respective groups
  • SampleSource.java: This is the place where the actual logic to read data from the source is written

Basic custom origin stage

Now you can build the template:

$ cd example_stage
$ mvn clean package -DskipTests

Extract the tarball to SDC’s user-libs directory, restart SDC, and you should see the sample stages in the stage library

$ cd ~/streamsets-datacollector-{version}/user-libs/ 
$ tar xvfz {new project root dir}/target/example_stage-1.0-SNAPSHOT.tar.gz x example_stage/lib/example_stage-1.0-SNAPSHOT.jar  

Restart the data collector and you will be able to see sample origin in the stage library panel

 

stage_panel 

Understanding the Origin Template Code
Let’s walk through the template code, starting with Groups.java.

Groups.java

The Groups enumeration holds the label for the configuration tab. Replace the label to have the label for AWS SQS

@GenerateResourceBundle
public enum Groups implements Label {
  SQS("AWS SQS"),
  ;
  private final String label;

SampleDSource.java

Stage and Its configurations definitions

Inside SampleDSource.java define the stage and its configurations and assign those configurations to respective groups. In our case we require AWS credentials, SQS endpoint and queue name to in order to retrieve messages from SQS.

@StageDef(
    version = 1,
    label = "SQS Origin",
    description = "",
    icon = "default.png",
    execution = ExecutionMode.STANDALONE,
    recordsByRef = true,
    onlineHelpRefUrl = ""
)
@ConfigGroups(value = Groups.class)
@GenerateResourceBundle
public class SampleDSource extends SampleSource {

  @ConfigDef(
          required = true,
          type = ConfigDef.Type.STRING,
          defaultValue = "",
          label = "Access Key",
          displayPosition = 10,
          group = "SQS"
  )
  public String access_key;

  @ConfigDef(
          required = true,
          type = ConfigDef.Type.STRING,
          defaultValue = "",
          label = "Secrete Key",
          displayPosition = 10,
          group = "SQS"
  )
  public String secrete_key;

  @ConfigDef(
      required = true,
      type = ConfigDef.Type.STRING,
      defaultValue = "",
      label = "Name",
      displayPosition = 10,
      group = "SQS"
  )
  public String queue_name;

  @ConfigDef(
          required = true,
          type = ConfigDef.Type.STRING,
          defaultValue = "",
          label = "End Point",
          displayPosition = 10,
          group = "SQS"
  )
  public String end_point;

  /** Delete message once read from Queue */
  @ConfigDef(
          required = true,
          type = ConfigDef.Type.BOOLEAN,
          defaultValue = "",
          label = "Delete Message",
          displayPosition = 10,
          group = "SQS"
  )
  public Boolean delete_flag;


  /** {@inheritDoc} */
  @Override
  public String getEndPoint() {
    return end_point;
  }

  /** {@inheritDoc} */
  @Override
  public String getQueueName() {
    return queue_name;
  }


  /** {@inheritDoc} */
  @Override
  public String getAccessKey() {
    return access_key;
  }

  /** {@inheritDoc} */
  @Override
  public String getSecreteKey() {
    return secrete_key;
  }

  /** {@inheritDoc} */
  @Override
  public Boolean getDeleteFlag() {
    return delete_flag;
  }
}

SampleSource.java

Read configurations and implement actual logic to read messages  from origin

Source extend BaseSource Interface from Streamset API

public abstract class SampleSource extends BaseSource {

An abstract method allows the source to get configuration data from its subclass:

The SampleSource class uses SampleDsource sub class to get access to the UI configurations. Remove the getConfig method with following methods

/**
 * Gives access to the UI configuration of the stage provided by the {@link SampleDSource} class.
 */
public abstract String getEndPoint();
public abstract String getQueueName();
public abstract String getAccessKey();
public abstract String getSecreteKey();
public abstract Boolean getDeleteFlag();

Validate Pipeline Configuration

SDC calls the init() method when validating and running a pipeline. The sample shows how to report configuration errors

@Override
protected List<ConfigIssue> init() {
    // Validate configuration values and open any required resources.
    List<ConfigIssue> issues = super.init();

    if (getEndPoint().isEmpty() || getQueueName().isEmpty() || getAccessKey().isEmpty() || getSecreteKey().isEmpty()) {
        issues.add(
                getContext().createConfigIssue(
                        Groups.SQS.name(), "config", Errors.SAMPLE_00, "Povide required parameters.."
                )
        );
    }

    // If issues is not empty, the UI will inform the user of each configuration issue in the list.
    return issues;
}

SDC calls destroy() during validation, and when a pipeline is stopped

/**
 * {@inheritDoc}
 */
@Override
public void destroy() {
    // Clean up any open resources.
    super.destroy();
}

Put custom logic to read data from source system

Produce method is one where we write the actual logic to read the data from source system. Replace the code with following code logic to read messages from SQS

public String produce(String lastSourceOffset, int maxBatchSize, BatchMaker batchMaker) throws StageException {
    // Offsets can vary depending on the data source. Here we use an integer as an example only.
    long nextSourceOffset = 0;
    if (lastSourceOffset != null) {
        nextSourceOffset = Long.parseLong(lastSourceOffset);
    }

    int numRecords = 0;

    // Create records and add to batch. Records must have a string id. This can include the source offset
    // or other metadata to help uniquely identify the record itself.

    AWSSQSUtil awssqsUtil = new AWSSQSUtil(getAccessKey(),getSecreteKey(),getQueueName(),getEndPoint());

    String queuName = awssqsUtil.getQueueName();
    String queueUrl = awssqsUtil.getQueueUrl(queuName);

    //maximum number of meesage that can be retrieve in one request
    int maxMessagCount = 10;

        List<Message> messages = awssqsUtil.getMessagesFromQueue(queueUrl,maxMessagCount);
        for (Message message : messages) {
            Record record = getContext().createRecord("messageId::" + message.getMessageId());
            Map<String, Field> map = new HashMap<>();
            map.put("receipt_handle", Field.create(message.getReceiptHandle()));
            map.put("md5_of_body", Field.create(message.getMD5OfBody()));
            map.put("body", Field.create(message.getBody()));

            JSONObject attributeJson = new JSONObject();

            for (Map.Entry<String, String> entry : message.getAttributes().entrySet()) {
                attributeJson.put(entry.getKey(), entry.getValue());
            }

            map.put("attribute_list", Field.create(attributeJson.toString()));

            record.set(Field.create(map));
            batchMaker.addRecord(record);
            ++nextSourceOffset;
            ++numRecords;
            if(getDeleteFlag()){
                awssqsUtil.deleteMessageFromQueue(queueUrl,message);
            }
        }
    return String.valueOf(nextSourceOffset);
}

Errors.java

Create custom Errors messages

To create stage specific error messages implement ErrorCode Interface

@GenerateResourceBundle
public enum Errors implements ErrorCode {

  SAMPLE_00("A configuration is invalid because: {}"),
  SAMPLE_01("Specific reason writing record failed: {}"),
  ;
  private final String msg;

  Errors(String msg) {
    this.msg = msg;
  }

  /** {@inheritDoc} */
  @Override
  public String getCode() {
    return name();
  }

  /** {@inheritDoc} */
  @Override
  public String getMessage() {
    return msg;
  }
}

Create the pipeline with custom origin

Follow the Build, Extract and Restart phase as done earlier and create the pipeline using the SQS Origin and provide configuration values. The pipeline will read click logs from SQS and extracts out the clicks which have been made from a particular browser and write it to the loca file system.

screen-shot-2016-11-24-at-3-05-06-pm

screen-shot-2016-11-24-at-3-16-16-pm

 

Run the pipeline and you will see the messages streaming from the SQS queue.

screen-shot-2016-11-24-at-3-26-20-pm

 

Congratulations!!! You have successfully created your first customized origin stage.