Performance Tuning – Tips & Tricks

Over the past few years, I’ve worked with a handful of partners where NGINX Plus performance was their primary concern. The conversation typically starts with challenges on their end hitting our published performance benchmarks. The challenge usually comes from the partner jumping straight to a fixed use case, such as using their existing SSL keys or targeting very large file payloads, and then seeing sub-par performance from NGINX Plus.

To a certain degree, this is expected behavior. I always like to explain to partners that as a software component, NGINX Plus can run at near line-rate speeds on any hardware that’s available to us when dealing with the most basic HTTP use case. In order to hit our published numbers with specific use cases, though, often NGINX Plus benefits from NGINX configuration, low-level OS, and hardware settings tweaking.

In every case to date, our partners have been able to achieve the theoretical performance numbers with very unique use cases simply by focusing on the components of the OS and hardware settings which need to be configured to match their use case and with how NGINX Plus interacts with those components.

Below is a complete list of NGINX configuration, OS, and hardware tips, tricks, and tweaks I’ve compiled over the years to help partners and customers achieve the highest possible performance with the open source NGINX software, NGINX Plus, and their specific use cases.

Starting on Tuning

I generally recommend the following workflow when tackling performance-tuning issues:

Start with performance testing NGINX Plus in the most generic HTTP use case possible. This will allow you to set your own benchmarking baseline in your environment first.
Next, identify your specific use case. If, for instance, your application requires large file uploads, or if you’ll be dealing with high-security large SSL key sizes, define the end-goal use case first.
Configure NGINX Plus for your use case and re-test to determine the delta between theoretical performance in your environment and real-world performance with your use case.
Begin tweaking one setting at a time by focusing on the settings that most apply to your use case. In other words, don’t change a bunch of systemctl configs while also adding new NGINX directives at the same time. Start small, and start with the features that are most applicable to your use case. For example, change SSL key types and sizes first, if high security is critical for your environment.
If the change doesn’t impact performance, revert the setting back to the default. As you progress through each individual change, you’ll start to see a pattern where like settings tend to affect performance together. This will allow you to home in on the groups of settings that you can later tweak together as needed.

It’s important to note that every deployment environment is unique and comes with its own networking and application performance requirements. It may not be advisable to change some of these values in production. Results of any configuration tweaks outlined below can result in dramatically different results based on the application type and networking topology. This document should only be used as a guide to a subset of configuration settings that can impact performance; it’s not an exhaustive list, nor should every setting below necessarily be changed in your environment.

With NGINX having such strong roots in the open source community, many people over the years have contributed back to the performance conversations. Where applicable, I’ve included links to external resources for specific performance-tuning suggestions from people who have already battle-tested many of these solutions in production.

NGINX Config Tuning

Please refer to NGINX documentation on details about configuring any of the below values, default settings, and the scope within which each setting is supported.

SSL

This section describes how to remove slow and unnecessary ciphers from OpenSSL and NGINX.

When SSL performance is paramount, it’s always a good idea to try different key sizes and types in your environment – finding the balance between longer keys for increased security and shorter keys for faster performance, based on your specific security needs. An easy test is to move from more traditional RSA keys to Elliptical Curve Cryptography (ECC), which uses smaller key sizes (and is therefore computationally faster) for the same level of security.

To generate quick, self-signed ECC keys for testing:

#ECC P-256
openssl ecparam -out ./nginx-ecc-p256.key -name prime256v1 -genkey
openssl req -new -key ./nginx-ecc-p256.key -out ./nginx-ecc-p256-csr.pem -subj '/CN=localhost'
openssl req -x509 -nodes -days 30 -key ./nginx-ecc-p256.key -in ./nginx-ecc-p256-csr.pem -out ./nginx-ecc-p256.pem

Enable keepalive to reuse SSL handshakes:

keepalive_timeout 300s;
keepalive_requests 1000000;

Caching and Compression

When serving static content, it’s advisable to leave open_file directives set to defaults – that is, don’t set any open_file directives for initial testing, then tweak from there, depending on your specific needs, to see how each setting affects your caching performance.

Some examples of directives which can be enabled after testing with defaults:

open_file_cache max=1000 inactive=20s;
open_file_cache_min_uses 2;
open_file_cache_valid 30s;
open_file_cache_errors off;

Setting gzip parameters can increase NGINX performance as these allow granular control over how NGINX delivers static, text-based content:

gzip on;
gzip_min_length 1000;
gzip_types text/html application/x-javascript text/css application/javascript;
gzip_disable "MSIE [1-6].";

More information on granular gzip control can be found in the NGINX documentation for the gzip module.

General

Please refer to the NGINX documentation for details on each one of these configuration options, proper syntax, scope of application (HTTP, server, location), etc. These are options that don’t fit in any specific category:

multi_accept off;

If multi_accept is disabled, a worker process will accept one new connection at a time. If enabled, a worker process will accept all new connections at a time.

While enabling multi_accept can be beneficial, it’s advisable to start performance testing disabled to better measure predictable scale.

accept_mutex on;

If accept_mutex is enabled, worker processes will accept new connections by turns. Otherwise, all worker processes will be notified about new connections; as a result, if the volume of new connections is low, some of the worker processes may just waste system resources.

Generally, accept_mutex on; gives even, predictable, best-case behavior under low-to-moderate loads. Consider disabling under very high load to maximize accept rate.

Note: There’s no need to enable accept_mutex on systems that support the EPOLLEXCLUSIVE flag (1.11.3) or when using reuseport (see note below re: reuseport).

proxy_buffering off;

When buffering is disabled, the response is passed to a client synchronously, immediately as it is received. Nginx will not try to read the whole response from the proxied server. The maximum size of the data that Nginx can receive from the server at a time is set by the proxy_buffer_size directive.

This is not just a setting. When enabled, it enables a separate specific code path within NGINX, and thus it is inadvisable to use it without a good reason.

access_log /path/to/access.log main buffer=16k; adds buffering to access log (16k or higher).

access_log off;

error_log off;

If using a tool for gathering NGINX metrics, it may be possible and advantageous to disable logging completely. For example, if you’re consuming JSON metrics from NGINX Plus or using a monitoring tool such as NGINX Amplify or an NGINX partner tool such as DataDog, you can safely disable local logging.

keepalive 128;

Use the keepalive directive to enable keepalive connections from NGINX Plus to upstream servers, defining the maximum number of idle keepalive connections to upstream servers that are preserved in the cache of each worker process. When this number is exceeded, the least recently used connections are closed. Without keepalives you are adding more overhead and being inefficient with both connections and ephemeral ports.

When enabling keepalive connections to your upstream servers, you must also use the proxy_http_version directive to tell NGINX Plus to use HTTP version 1.1, and the proxy_set_header directive to remove any headers named Connection. Both directives can be placed in the http, server, or location configuration blocks.

proxy_http_version 1.1;

proxy_set_header Connection "";

sendfile on;

sendfile_max_chunk 512;

tcp_nopush on;

tcp_nodelay on;

These four settings all affect how NGINX pushes data into and out of TCP sockets.
sendfile on; enables NGINX to utilize the OS system call, often speeding up TCP data transfers. A good rule of thumb: disable for small responses, enable for large responses.

Note: Because data copied with sendfile() bypasses user space, it’s not subject to the regular NGINX processing chain and filters that change content, such as gzip. When a configuration context includes both the sendfile directive and directives that activate a content‑changing filter, NGINX automatically disables sendfile for that context.

sendfile_max_chunk sets the size of the chunks pushed to the upstream servers.

tcp_nopush on; ensures that packets are full before sending them out.

tcp_nodelay on; forces data in a buffer to be sent immediately.

client_header_timeout 3m;

client_body_timeout 3m;

send_timeout 3m;

Optimal timeout values can offer drastic improvements with applications that keep long-lived connections open before closing them on the backend.

listen 80 reuseport;

The reuseport parameter enables the SO_REUSEPORT function in NGINX, enabling port sharding.
For more information, please refer to our blog post on socket sharding.

Thread Pooling

Thread pooling consists of a task queue and a number of threads that handle the queue. When a worker process needs to do a potentially long operation, instead of processing the operation by itself, it puts a task in the pool’s queue, from which it can be taken and processed by any free thread.

Enabling and disabling thread pulling in NGINX is relatively straightforward in .conf files:

aio threads;

The way thread pools are managed, however, can be affected by other buffer-related configuration settings. For complete information on tweaking other settings to support thread pooling, please refer to our blog post on thread pools.

CPU Affinity

CPU affinity is used to control which CPUs NGINX utilizes for individual worker processes (find background on CPU affinity in the NGINX documentation):

worker_processes 4;
worker_cpu_affinity 0001 0010 0100 1000;

Here’s an easy way to determine cpumask:

Each CPU bit mask must be as long as the number of vCPUs. Eg 16 vCPUs/cores == 16 bits.
Sequential numbering starts at the right. 0001 is CPU 1, 0010 is CPU 2, etc.

Example: 16 vCPUs with load distributed across only vCPU #s 6, 9, and 16

worker_cpu_affinity 0000000000100000 0000000100000000 1000000000000000;

Steps to define and test (based on a hyper-threaded server with 16 available vCPUs):

Determine the number of available vCPUs: # lshw -C cpu | grep -i thread

Determine the configuration:
cores=8 enabledcores=8 threads=16
# cat /proc/cpuinfo | grep -i processor | wc -l
16 (this is the number of CPUs that will correlate to the number (and length) of available bits for your CPU mask)

Add the following to nginx.conf to only span the first 4 of the available 16 vCPUs:
worker_processes 4;
worker_cpu_affinity 0001 0010 0100 1000;

Load NGINX with a quick wrk session:
# wrk -t 1 -c 50 -d 20s http://localhost/1k.bin

If necessary, you can create a simple 1k.bin file for testing with:
# dd if=/dev/zero of=1kb.bin bs=1024 count=1

Run top in CPU view mode (by pressing 1 after top starts).

You can repeat with different numbers of processes and affinity bindings to see the linear scale. That’s an effective way to set the access limit to the appropriate subset of available cores.

Linux/Kernel optimizations

Kernel/OS Settings via `systctl`

These instructions will detail how to temporarily change these settings for testing in a typical Linux environment. Specific values can vary widely between kernel and NIC driver versions; these values are ones that we’ve seen have a successful impact on NGINX performance over the years, but these specific values may not have the same impact in your environment.

Please consult your distro documentation for more information on any of these settings and how to set them permanently.

Some options for sysctl settings follow. You may need to experiment with different settings to find the best values for your installation.

# sysctl -w net.ipv4.ip_local_port_range="1024 65000"

One way to reduce ephemeral port exhaustion is with the Linux kernel setting net.ipv4.ip_local_port_range. The default range is most commonly 32768 through 61000.

If you notice that you’re running out of ephemeral ports, consider changing the range from the default, which is most commonly 32768 through 61000, to 1024 through 65000. As we describe in our blog post on the topic, this is a practical way to double the number of ephemeral ports available for use. For more information about changing kernel settings, see our earlier blog post on tuning NGINX for performance.

sysctl -w net.ipv4.tcp_fin_timeout = 15

Changes the default timeout value used to determine when a port can be reused.

sysctl -w net.core.somaxconn=1024

sysctl -w net.ipv4.tcp_max_tw_buckets=1440000

sysctl -w net.core.netdev_max_backlog=1024

sysctl -w net.ipv4.tcp_max_syn_backlog=3240000

The above settings all affect the Linux networking connection queue. Consult your distro documentation for the correct and/or recommended values for your kernel and network driver. These are good values to test when dealing with connection-related issues, such as time-outs.

sysctl -w net.core.rmem_default = 8388608

sysctl -w net.core.wmem_default = 8388608

sysctl -w net.core.rmem_max = 16777216

sysctl -w net.core.wmem_max = 16777216

sysctl -w net.ipv4.tcp_rmem = “4096 87380 16777216”

sysctl -w net.ipv4.tcp_wmem = “4096 65536 16777216”

sysctl -w net.ipv4.tcp_congestion_control = cubic

These are all settings that affect TCP buffer sizes.

sysctl -w net.ipv4.tcp_syn_retries=2

Control how many connection retries are attempted before failure. Note that the lower the value, the more failures you may see, but the more quickly NGINX will consider an upstream unavailable.

Hardware Management

Increase the number of ring buffers (note: there will be several seconds connectivity loss after you execute the command):

# ethtool -G eth0 rx 4096 tx 4096

Apply to all NICs in your environment and adjust buffer sizes according to your documentation.

When using multiple NICs in one device, disable irqbalance, and run the set_irq_affini.sh script for each NIC:

# set_irq_affinity.sh eth0

# set_irq_affinity.sh eth1

Scripts are available on GitHub:
<a href="https://gist.github.com/SaveTheRbtz/8875474" rel="noopener" https://gist.github.com/SaveTheRbtz/8875474https://gist.github.com/syuu1228/4352382

Disabling the Aging Timer Rollover in the BIOS to mitigate issue with rx_no_dma_resources, although not supported in every BIOS, can be an area where the hardware can be further tuned for performance optimization.

General Sizing and Testing

Sizing

Here’s a very rough sizing approximation for general web server and load balancing functionality (may not be as applicable for VOD streaming or CDN):

CPU:

1 cpu core per 1-2 Gb/s of unencrypted traffic.
Small (1-2KB) responses and one response per connection will increase CPU load.

RAM:

1GB for OS and other general needs.
The rest is divided among Nginx buffers, socket buffers, and virtual memory cache. Rough estimate is 1MB per connection.

Details:

proxy_buffers (per connection)
proxy_buffers size should be chosen to avoid disk i/o. If response size is larger than (proxy_buffers size + proxy_buffer_size) the response may be written to disk, thus increasing I/O, response time, etc.

Sizing shared memory zones:

On the surface, NGINX zones are used to store data shared by multiple upstream servers, such as status, metrics, cookies, healthchecks, etc.
Zone can also affect how NGINX distributes load between various components such as worker processes, however. For full documentation on what components a zone stores and effects, please refer to this section of the load balancing Admin Guide.

There are no exact settings due to quite different usage patterns. Each feature, such as sticky cookie/route/learn load balancing, health checks, or re-resolving will affect the zone size. For example, the 256Kb zone with the sticky_route session persistence method and a single health check can hold up to:

128 servers (adding a single peer by specifying IP:port)
88 servers (adding a single peer by specifying hostname:port; hostname resolves to a single IP)
12 servers (adding multiple peers by specifying hostname:port, hostname resolves to many IPs)
When creating zones, it’s important to note that the shared memory area is controlled by the name of the zone. If you use the same name for all zones, then all data from all upstreams will be stored in that zone. In this case, the size may be exceeded.

Disk I/O:

The limiting factor for disk I/O is a number of i/o operations per second (iops).
NGINX depends on disk I/O and iops for a number of functions, including logging and caching.

See notes above for specific settings with regard to logging and caching.

The post Performance Tuning – Tips & Tricks appeared first on NGINX.

Source: Performance Tuning – Tips & Tricks

지락문화예술공작단

Performance Tuning – Tips & Tricks

Starting on Tuning

NGINX Config Tuning