Using NGINX Plus for Backend Upgrades with Zero Downtime, Part 1 – Overview
Upgrading backend servers in a production environment can be a challenge for your operations or DevOps team, whether they are dealing with an individual server or upgrading an application by moving to a new set of servers. Putting upstream servers behind NGINX Plus can make the upgrade process much more manageable while also eliminating or greatly lessening downtime.
In a three-part series of articles, we’ll focus on NGINX Plus – with a number of features above and beyond those in the open source NGINX software, it’s a more comprehensive and controllable solution for upgrades with zero downtime. This first article describes the two NGINX Plus features you can use for backend upgrades – the on-the-fly reconfiguration API and health checks – in detail and compares them to upgrading with the open source NGINX software.
The related articles explain how to use the methods for two classes of upgrades:
- Upgrading hardware or software on an individual server machine
- Upgrading to a new version of an application by switching traffic to completely different servers or upstream groups
Choosing an Upgrade Method in NGINX Plus
NGINX Plus provides two methods for dynamically upgrading production servers and application version:
- On-the-fly reconfiguration API – Use an HTTP-based API to send HTTP requests to NGINX Plus that add, remove, or modify the servers in an upstream group.
- Application-aware health checks – Define health checks so that you can purposely fail servers you want to take out of the load balancing rotation, and make them pass the health check when they are again ready to receive traffic.
The two methods differ with respect to several factors, so the choice between them depends on your priorities:
- Speed of change – With the API, the change takes effect immediately. With health checks, the change doesn’t take effect until a health check fails (the default frequency of health checks is 5 seconds).
- Initial traffic volume – With health checks, you can configure slow start: when a server returns to service, NGINX Plus slowly ramps up the load to the server over a defined period, allowing applications to “warm up” (populate caches, run just-in-time compilations, establish database connections, and so on). The server is not overwhelmed by connections, which might time out and cause it to be marked as failed again. With the API, NGINX Plus immediately sends a server its full share of traffic.
- Automation and scripting – With the API, you can automate and script most phases of the upgrade, and do so within the NGINX Plus configuration. To automate upgrades when using health checks, you must also create scripts that run on the servers being upgraded (for example, to manipulate the file used for semaphore health checks).
In general, we recommend the NGINX Plus on-the-fly reconfiguration API for most use cases because changes take effect immediately and the API is fully scriptable and automatable.
Upgrading with Open Source NGINX
First, let’s review how upgrades work with the open source NGINX software, and explore some possible issues. Here you change upstream server groups by editing the upstream
configuration block and reloading the configuration file. The configuration reload is seamless because a new set of worker processes are started to utilize the new configuration, while the existing worker processes continue to run and handle connections that were open when the reload occurred. Each old worker process terminates as soon as all its connections have completed. This design guarantees that no connections or requests are lost during the reload, and makes the reload method suitable even when upgrading NGINX itself from one version to another.
Depending on the nature of the outstanding connections, the time it takes to complete them all can range from just seconds to several minutes. If the configuration doesn’t change often, running two sets of workers for a short time usually has no bad effects. However, if changes (and consequently reloads) are very frequent, old workers might not finish processing requests and terminate before the next reload takes place, leaving multiple sets of workers running at once. With enough workers, you might eventually end up exhausting memory and hitting 100% CPU, particularly if you’re already optimizing use of resources by running your servers at close to capacity.
When you’re load balancing application servers, upstream groups are the part of the configuration that changes most frequently, whether it’s to scale capacity up and down, upgrade to a new version, or take servers offline for maintenance. Customers running hundreds of virtual servers load balancing traffic across thousands of backend servers might need to modify upstream groups very frequently. Using the reconfiguration API or health checks in NGINX Plus, you avoid the problem of frequent configuration reloads.
Overview of the NGINX Plus Upgrade Methods
The use cases discussed in the two related articles use one of the following methods, sometimes in combination with auxiliary actions.
Upgrading with the On-the-Fly Reconfiguration API
To use the on-the-fly reconfiguration API to manage the servers in an upstream group, you issue HTTP commands which all start with the following URL string. We’re using the conventional location name for the API, /upstream_conf, but you can configure a different name (see the section about the base configuration in the second or third article).
http://NGINX-server[:port]/upstream_conf?upstream=upstream-group-name
When you issue this command with no additional parameters, a list of the servers and their ID numbers is returned, as in this example for the use cases we’ll cover in the other two articles:
http://localhost:8080/upstream_conf?upstream=demoapp
server 172.16.210.81:80; # id=0
server 172.16.211.82:80; # id=1
To make changes to the servers in the upstream group, append other strings to the base URL as indicated:
-
Add a server – Append this string:
...&add=&server=IP-address[:port]
By default, the server is marked
up
and NGINX starts sending traffic to it immediately. To mark itdown
so that it does not receive traffic until you are ready to mark it as up, append thedown=
parameter:...&add=&server=IP-address[:port]&down=
-
Remove a server – NGINX Plus terminates all connections immediately and sends no more requests to the server. Append this string:
...&id=ID&remove=
-
Mark a server as
down
– NGINX Plus stops opening new connections to the server, but any existing connections are allowed to complete. Using the NGINX Plus live activity monitoring dashboard or API, you can see when the server no longer has any open connections and can be safely taken offline....&id=ID&down=
-
Mark a server as
drain
(ing) – NGINX Plus stops sending traffic from new clients to the server, but allows clients who have a persistent session with the server to continue opening connections and sending requests to it. Once you feel that you have allowed enough time for sessions to complete, you can mark the server asdown
and take it offline. For a discussion of ways to automate the check for completed sessions, see Using the API with Session Persistence for an Individual Server Upgrade....&id=ID&drain=
-
Mark a server as
up
– NGINX Plus immediately starts sending traffic to it....&id=ID&up=
-
Change server configuration – You can set any of the parameters on the
server
directive. We’ll use this feature to set server weights in several of the use cases. - You take servers up and down by taking action on the backend servers rather than by interacting with NGINX Plus. Most commonly, you define the health check to succeed if a particular file (healthcheck.html, for example) exists on the server, and to fail if it doesn’t. To take the server down, you make the health check fail by removing or renaming the file; to bring it up, you make the health check succeed by restoring the file or changing the name back to healthcheck.html.
- With health checks, changes are not immediate as with the API, but instead depend on the health-check frequency. By default, health checks run every five seconds and only one failure is required for a server to be considered unhealthy. So with the default setting, it can take up to five seconds for NGINX Plus to change the state of the server.
- An advantage of health checks over the API is that you can specify a timeframe after a server returns to health during which NGINX Plus gradually ramps up the load on the server (the slow-start feature). This is helpful if your servers need to “warm up” before they are ready to receive their fair share of load.
- You can’t use health checks when using session persistence. When NGINX Plus marks a server as
down
because it fails a health check, the server no longer receives new connections, even from clients that are pegged to it by a session persistence mechanism. (In other words, with health checks you can set server state to the equivalent of the API’sup
anddown
, but not todrain
.) - Upgrading hardware or software on an individual server machine
- Upgrading to a new version of an application by switching traffic to completely different servers or upstream groups
Upgrading with Application Health Checks
Configuring application health checks is an easy way to improve the user experience at your site. By having NGINX Plus continually check whether backend servers are up and remove unavailable servers from the load-balancing rotation, you reduce the number of errors seen by clients. You can also use health checks to bring servers up and down, instead of (or in addition to) the API.
There are a few significant differences between health checks and the API:
Conclusion
NGINX Plus provides operations and DevOps engineers with several options for managing the upgrade process for both individual servers and groups of servers, all while continuing to provide a good customer experience by avoiding downtime. For comprehensive instructions on using the upgrade methods for specific use cases, see the other two articles in this series:
Try NGINX Plus out for yourself and see how it makes upgrades easier and more efficient – start a 30-day free trial today or contact us for a live demo.
The post Using NGINX Plus for Backend Upgrades with Zero Downtime, Part 1 – Overview appeared first on NGINX.
Source: Using NGINX Plus for Backend Upgrades with Zero Downtime, Part 1 – Overview
Leave a Reply