More Fun with NGINX Plus Health Checks and Docker Containers

At nginx.conf 2017, I gave a presentation on this topic, which you can access as a YouTube video or a blog post, which includes the Powerpoint slides and a transcription of my talk. In this blog post, I’ll describe an improved version of the basic approach, then give specific, working configuration code you can use to implement it yourself.

Introduction

When running containers in a microservices environment, your service instances may be susceptible to becoming overloaded due to resource limitations, such as memory or CPU utilization. A number of strategies can be employed to address this issue; this blog post addresses using NGINX Plus active health checks as one strategy.

We’ll focus on three different use cases:

Request-count-based. Use this method when requests to a service are so heavyweight that a service instance can only handle one request at a time.
CPU-usage-based. Use this method when CPU utilization is the main limiting factor for a service and you want to set a CPU usage threshold, after which the service shouldn’t accept any additional requests.
Memory-usage-based. Use this method when memory utilization is the main limiting factor for a service and you want to set a memory usage threshold, after which the service shouldn’t accept any additional requests.

All three of these methods work in the same fundamental way. A program is written to act as the active health check, called by NGINX Plus. Based on one of the methods above, this program will either return a status of healthy or unhealthy – causing NGINX Plus to either remove it from the load balancing rotation when it shows as unhealthy, or add it to the load balancing rotation when it shows as healthy.

Health Check Approaches

Let’s get into the details of each method. Code for the examples is available here.

For all of the examples, we’re using NGINX Plus as the load balancer and NGINX Unit as the application server, with two examples written in PHP and one written in Python. These are all running in Docker containers.

Request-Count-Based

For this method, a semaphore file, /tmp/busy, is created by the application as soon as a request is received, then removed when the request processing is completed. When you run the health check, it checks to see if the file exists. If the file is found, the health check returns a status of unhealthy, causing NGINX Plus to stop sending requests to the service instance. Once the request has been completed, the file is removed and the health check will show as healthy again.

For the example, a single Python program, testcnt.py, is used as the application, and to do the health check; the function to execute is governed by the URI.

The shortest interval between health checks is one second, so it may take up to one second for NGINX Plus to see that the service instance is busy. During that one second, NGINX Plus may send another request to the service instance. To handle this case, the application will return an HTTP status code of 503 when it receives a request while already busy processing another request. If this happens, NGINX Plus tries another upstream server.

CPU-Based

The Docker API can be used to get CPU usage metrics for a container, but the metrics returned are relative to the Docker host. In other works, if the Docker API reports that the CPU usage for a container is 25% of the CPU – that is 25% of the CPU for the Docker host.

For this example, we set a threshold of 70% for all the containers for this application, and divide that by the number of containers to get the threshold per container. For example, if there is one container it can use 70% of the Docker hosts’ CPU. If there are two containers, each can use up to 35% of the Docker hosts’ CPU.

The NGINX Plus Status API is used to get the number of containers for the application.

There are two PHP programs: testcpu.php, which is the application that generates CPU load, and hcheck.php, which does the health check.

To get statistics for a container, the health check page makes the following call to the Docker API on the Docker host:

http://Docker Host IP Address:Docker API Port/containers/Container ID/stats?stream=0

To calculate the CPU usage, two calls must be made to the API; one second apart, in this case. The cpu_stats.cpu_usage.total_usage field from these two calls is used to calculate the CPU usage.

Memory-Usage-Based

As in the CPU-based example, the Docker API is used to retrieve the memory usage metrics, Each container is limited to 128 megabytes of memory and the memory usage metrics are relative to this limit.

There are two PHP programs: testmem.php, which causes memory usage, and hcheck.php, which does the health check. If the memory usage is above 70%, the health check returns a status of unhealthy.

The health check makes the same Docker API call as shown for the CPU usage method, but to get the memory usage it uses the fields memory_stats.usage and memory_stats.stats.hierarchical_memory_limit. It calculates the memory utilization percentage as memory_stats.usage/memory_stats.stats.hierarchical_memory_limit.

NGINX Configuration

There are no changes required in the main NGINX configuration file (/etc/nginx/nginx.conf). If you want to see detailed messages in the error log for health checks, you should set the log level to info. For example:

error_log  /var/log/nginx/error.log info;

The following is the specific NGINX Plus configuration for the example applications. Please consider the following in reading and, potentially, reusing this configuration:

Consul is used for DNS service discovery, and both Consul and NGINX Plus support DNS SRV records. This allows NGINX Plus to get not only the IP addresses of the containers, but also the ports. This is necessary because Docker port mapping is used.
The first server block, listening on port 80, allows requests to be sent to the health checks directly. This is required to see what an unhealthy health check looks like. If you were to try to send a request to a health check using a virtual server that has health checks in place, NGINX Plus will not allow requests to be sent to unhealthy servers.
To keep the configuration easy to understand, it has been kept minimal. Not all the directives of a best-practices configuration have been included.
The health check intervals are all short, so the system responds quickly while being demonstrated. The one-second interval for the count based health check would likely also be used in production, since you want NGINX Plus to stop sending requests as soon as possible after the service becomes busy. The health check intervals for the other two health checks might be set to higher values in production.
This configuration and the CPU health check program utilize the the dashboard.html page and Version 2 of the NGINX Plus API, both included in the NGINX Plus R14 release.
These examples are intended to show some ideas on how active health checks can be used, and have not been tested in production or at scale.

The application configuration (/etc/nginx/conf.d/backend.conf):

# Configure DNS.  Point to Consul

resolver consul:53 valid=2s;

resolver_timeout 2s;
# The upstreams will be populated via DNS

upstream unitcnt {

    zone unitcnt 64k;

    server service.consul service=unitcnt resolve;

}
upstream unitcpu {

    zone unitcpu 64k;

    server service.consul service=unitcpu resolve;

}
upstream unitmem {

    zone unitmem 64k;

    server service.consul service=unitmem resolve;

}
# All successful health checks will have a string starting with {"HealthCheck":"OK"

match server_ok {

    status 200;

    body ~ '{"HealthCheck":"OK"';

}
server {

    # Allows calling upstream health checks directly

    listen 80;

    location /healthcheck {

        proxy_pass http://$arg_server/hcheck.php;

    }

    location /healthcheckpy {

        proxy_pass http://$arg_server/testcnt.py?healthcheck;

    }

}
server {

    listen 8001;

    status_zone unitcnt;

    root /usr/share/nginx/html;

    proxy_http_version 1.1;

    proxy_set_header Connection "";

    location ~ .py$ {

        proxy_set_header Host $http_host;

        proxy_pass http://unitcnt;

        proxy_intercept_errors on;

        proxy_next_upstream http_503;

        # If all the servers are busy return apibusy.html

        error_page 502 503 =503 /apibusy.html;

        health_check uri=/testcnt.py?healthcheck match=server_ok interval=1s;

    }

}
server {

    listen 8002;

    status_zone unitcpu;

    root /usr/share/nginx/html;

    proxy_http_version 1.1;

    proxy_set_header Connection "";

    location ~ .php$ {

        proxy_set_header Host $http_host;

        proxy_pass http://unitcpu;

        error_page 502 =503 /apibusy.html;

        health_check uri=/hcheck.php match=server_ok interval=5s;

    }

}
server {

    listen 8003;

    status_zone unitmem;

    root /usr/share/nginx/html;

    proxy_http_version 1.1;

    proxy_set_header Connection "";

    location ~ .php$ {

        proxy_set_header Host $http_host;

        proxy_pass http://unitmem;

        error_page 502 =503 /apibusy.html;

        health_check uri=/hcheck.php match=server_ok interval=3s;

    }

}
# Configure the status API and dashboard

server {

    listen 8082;
    root /usr/share/nginx/html;
    location = /dashboard.html {

    }
    location = / {

        return 302 /dashboard.html;

    }
    location /api {

        access_log off;

        api;

    }
}

Conclusion

NGINX Plus active health checks are an easy way of dealing with capacity limitations of services running in Docker, helping to make sure that service instances aren’t overloaded.

Get an NGINX Plus free trial and download the Unit beta and give it a try! All the code for the examples is available here.

The post More Fun with NGINX Plus Health Checks and Docker Containers appeared first on NGINX.

Source: More Fun with NGINX Plus Health Checks and Docker Containers

지락문화예술공작단