How Expedia Uses NGINX for Cloud Migration at Scale
td {
padding-right: 10px;
}
This post is adapted from a presentation delivered at nginx.conf 2016 by Dave Drinkle of Expedia, Inc. You can view a recording of the presentation on YouTube.
Table of Contents
Dave Drinkle: My name’s Dave Drinkle. I’m a Senior Software Engineer with Expedia. I’ve been with Expedia for about six years, and for the last three years I’ve been working specifically on NGINX configuration for routing traffic through our front door. So what I want to do today is just walk through some of the tips and tricks that we’ve learned.
1:19 Three Pillars of Cloud Migration
Today, I want to take about the three pillars that we’ve built our cloud migration on.
The first is multi‑region resiliency. This is how we build cross‑regional failover into our NGINX configurations so that if something goes down in one region, we auto‑fail over to another one. We’ll talk about how we do that. With NGINX it’s pretty straightforward to do all of these things, so I just wanted to bring them to light.
The second thing I want to touch on is avoiding the knife edge. At Expedia, we really try to focus on making slow changes. If we’ve put a new app or microservice out there in the cloud, we want to do that in a slow, controlled manner, and we want to have a way to be able to roll that back as quickly as possible if needed, as well.
The last thing I want to talk about is reacting to errors. And this is how can we set up our proxy to react to the errors that are coming back from our microservices or apps.
Before I get into this, I want to give you guys a configuration that’s actually pretty functional. I can’t touch on everything because of time, but before I dive into those three pillars, I want to give us a starting point.
3:00 Traffic Through NGINX Before Cloud Migration
This is basically what Expedia’s traffic routing looked like before we really went to the cloud.
Traffic would come in through the browser, it would hit our CDN and then the traffic would go to our data centers. Pretty straightforward.
The first step we have to do when we’re moving to the cloud is get our NGINX cluster out there and our traffic through it. We want to put NGINX in there as a man in the middle, but we still want to route that traffic back to the data center.
3:23 Traffic Through NGINX After Cloud Migration
This is where we’re going. The CDN is going to break that traffic up, route it into our multiple regions. But instead of the traffic going into our microservices, we’re going to route it all the way back to the data center.
This where I want to start and it gets us into our basic configuration.
3:43 Basic Configuration
You’re also going to need your access logs, and your error logs, and all your proxy parameters and all that kind of stuff, but this is the basic configuration for getting your data center set up.
We’ve set it up with two data centers that are weighted 70/30. We’ve got max_fails
and fail_timeout
set up and then we’ve added the resolve
parameter [on the server
directives], which is an NGINX Plus only feature – that ensures that our DNS for our datacenters are getting resolved based on the resolver
configuration that we have.
Then the server
configuration is pretty straightforward. There’s nothing too complex going on here. That location
block there is going to take all of the traffic that doesn’t have another route defined, and we’re going to route it to our data center with that proxy_pass
line.
A good practice is to always set the proxy_set_header
for the Host
and X_Forwarded-For
headers.
So this isn’t too complex, it’s just a base configuration that we’re going to build on as we go.
4:58 Multi-Region Resiliency
Why do we need multi-region resiliency?
At Expedia, we focus on two main things. First is fault tolerance. We want to be sure that our customers are always getting a response if at all possible. That may mean we need to route traffic from one region to another.
We also want to reduce latency. If your microservice can be built so that it doesn’t have to phone home to the data center, then you can get reduced latency, by having your microservices deployed to as many regions as you can. And that means putting your NGINX in place there.
For the actual resiliency piece, we’ll use NGINX Plus health checks for the auto‑failover.
5:45 Multi-Region Resiliency In Pictures
Here’s a diagram to illustrate that visually. You can see all our traffic is coming in from our CDN. It’s going into our regional NGINX clusters. [Traffic coming into] the NGINX cluster in Region 1 is getting routed to the app in Region 1, and the NGINX clusters in Region 2 are routing it to the app in Region 2.
But what happens if we have a problem with that app?
6:20 Auto-Failover
What we want to get to is this diagram. If the app in Region 1 fails, the NGINX cluster in Region 1 is going to stop sending traffic to the app in Region 1 and route it over to Region 2.
Now, there’s some networking layer stuff here we have to make sure we have in place. You’ve got to make sure that you can actually talk from Region 1 to Region 2.
6:50 Routing to Your App
The configuration we will show as we go along will be the app configuration, and it’s very similar to what I just showed you with the data center.
We have our [upstream
block called] app_upstream. We have one primary server, the top one there, app.region1. It’s set up with the max_fails
and fail_timeouts
and resolve
directives. That’s going to take all of our traffic.
Then the second server is just a backup. It’s set up to go over to our Region 2. The key there is, if we get hard failures with 500
s or that kind of thing from the Region 1 server, then NGINX will fail over. That’s what the max_fails
is all about.
And then we’re also going to set up our health checks at the bottom of this config. So, the next section there is the match
, that tells NGINX what we consider to be a valid response from a health check. We’re not really too concerned about what’s in the response. We’re just saying that if it’s got a status between 200
and 399
, we consider it a valid response for the health check.
In the configuration for our actual application, we’re going to say that if you make a request on /myapp, that’s going to go to our application in the cloud. That’s what this proxy_pass
line is all about.
You’ll notice here that I’ve actually split out the health check. There’s a couple reasons I’ve done this. One, in the next couple slides, we’re going to mess with this upstream, so I don’t want to have my health checks be tied directly to this particular location
block.
The other reason to do this is when you have multiple URL paths that you want to route to a single microservice. This way we can have multiple paths and multiple location
blocks within our configuration, and we don’t have to have the health‑check configuration multiple times within there.
So this works well. We just put health checks, and then whatever the app is that we’re health checking and we configure it that way.
The reason this works is because when NGINX does health checks, it checks the upstream itself. So if you have two location
blocks that are both using the same upstream, [then] as long the health check fails on one of them, it will fail on everything related to that upstream. That’s why we can break it out and it will still work fine for everything.
The health‑check logic is pretty straightforward. We’re saying that we want to match on the criteria from is_working
. If it fails twice, we’re going to consider it bad. It’s going to check every 15 seconds, so it has to fail twice within 30 seconds. We want to slow that down for positives, so we have our passes
take a little bit longer, and then we have the uri
.
There’s lots of other configuration you can do within the health check itself, but for now (for just the simplicity of the configuration here), I’m just showing you a bare‑bones config. So, this will get you going for sure, and you can kind of go from there.
That’s our bare‑bones config for multi‑region resilience. It’s very simple, but it really does give us an auto‑failover from one region to the other.
Let’s talk about the knife edge. One of the things about this configuration is that if we just put this in here right now, every request for my app would immediately start going to our application’s upstream. It wouldn’t be split between the data centers. Let’s talk about how to deal with that.
10:32 Removing the Knife Edge
Why do we want to remove the knife edge? We want traffic to be moved from one origin to another in a controlled manner. When you’re working at scale, if we can send just 10% of that traffic to a new microservice and then start slowly, methodically ramping that up to 100%, we’re going to be much better off.
If you’re taking the kind of traffic that Expedia takes, you don’t want to break everything, even if it’s just for a few seconds. You really want to try and do this as slowly as possible without being too slow.
Obviously, this method is only useful for URL patterns that are currently taking traffic. If you’ve got a brand‑new URL pattern that’s going to go to a brand‑new microservice, you would not use this configuration.
What we’re going to do is use two modules that come with NGINX called the User ID plug‑in [module] and the Split clients plug‑in, and we’re going to make them work together.
Here is a diagram of what we’re going to do. Traffic is going to come in from the CDN. It’s going to hit that NGINX cluster and the NGINX cluster’s going to split that however we need to, between your app in the cloud and your data center.
11:51 Setting Up the Cookie
Let’s talk about the cookie first. It’s a pretty basic configuration.
We turn on the userid
cookie. We set what the name of the cookie is going to be. We set the path. We set the expiry pretty high at 365 days because we don’t need it to change very often.
There’s one big caveat with this, and that is that the userid
cookie is going to generate an ID, but it’s not guaranteed unique. It’s not a GUID. It’s close to a GUID and it’ll be fairly unique, but you will get duplicates.
If it’s really important for you to have a perfect split and everybody get a unique cookie, you can’t use the User ID plug‑in. You’d have to come up with some other mechanism to generate a cookie prior to actually hitting the proxy or before NGINX actually starts processing the request. This could be done with Lua, but that’s beyond the scope of this presentation.
13:00 Split_Clients Config
The split_clients
configuration is also really simple. The split_clients
kind of works like a map – if you’ve ever used maps within NGINX – where we’re going to inspect the $cookie_ourbucket
, and put that value through a hashing algorithm. The algorithm is MurmurHash2 which takes that key or that value and generates a big number, and from that number, generates a percentage.
If you fall into a certain percentage – say we get 9% – we would go to our app_upstream upstream group. If we’re anything above 10%, we’re going to go to our data center. What happens then is the value of app_upstream
or value of datacenter
gets applied to the variable $app_throttle
.
There are three things I want to mention about this. You cannot use variables within the value of your split_clients
. You can use upstreams, but you just can’t use variables. The variable will get interpreted as a string and you’ll get a string with your variable name.
Also, zero is not a valid option. So, you cannot put a zero as a percentage. If you do have to take your percentage back down to 0%, you should either comment the line out or just remove the line.
The other issue is with $cookie_ourbucket
. The very first time a user hits your proxy and generates the cookie, that’s when the User ID plug‑in is generating that cookie – which is perfect, it works great, except that when the cookie is generated, it does not get applied to the variable $cookie_ourbucket
.
So, the very first time a customer comes into your proxy without the cookie and it’s generated, this variable will be empty. Then your hashing algorithm hashes every one of those people exactly the same way, with the exact [same] percentage and you end up with a pretty bad split.
15:10 Modified Split_Clients Config
So, what we came up with was a slightly different modification to this configuration. It uses two variables that are available from the User ID plug‑in. It’s a very small change. Instead of inspecting $cookie_ourbucket
, we’re going to inspect two variables; uid_set
and uid_got
.
These two variables are mutually exclusive variables: only one will ever be set at a time. uid_set
will be set when the User ID plug‑in sets the cookie, and it’ll be blank when the cookie has been sent in from the browser. uid_got
will have the same value when the request has come in from the browser with the cookie, and it’ll be blank whenever the User ID plug‑in is setting it.
Effectively, what we’ve done here is we’ve used two variables. We know that one of them’s always going to be blank, so we end up getting the same result from both of them. This way, even for your very first request that comes into your proxy, you’re still going to get a nice split.
16:28 Traffic Routing with Split_Client Values
The last bit of this configuration is really simple and is just setting the upstream in the /myapp location
block. All we’re going to do is instead of using app_upstream like before, we’re going to use $app_throttle
, which we set previously with that split_clients
config.
Really straightforward, and what we end up getting is a nice split that we can control with our code just by switching the percentages.
What we like about this kind of configuration is how easy it is to roll back. You may have a team that ramps up traffic to a new app and it looks good. Everybody’s happy, everybody goes home, and then over the next couple of days you’re actually making more configuration changes for other different microservices teams or whatever. Then the original app team comes back and says, “Hey, we need to roll back. We have a problem that we didn’t recognize on deployment night.”
Now, instead of having to actually roll back the code, you can roll back by just changing your app_upstream
to zero percent, and all of a sudden you’re back to the data center for all the traffic.
18:07 Reacting to Application Errors
The last thing I want to talk about is reacting to application errors. I’m going to talk about two different types of errors: hard errors and soft errors.
Hard errors are pretty obvious, these are your 400
‑ and 500
‑level errors that are clearly application misconfigurations or failures. Soft errors, on the other hand, are those application errors that require the request to be reprocessed.
We’ll generally do soft errors in HTTP land. You’ll generally do soft errors with a redirect – a 301
, 302
, 307
, something like that. I’m going to also show you some other ways we can do that which are actually pretty interesting.
18:54 Reacting to Hard Errors
Let’s talk about hard errors first. Why do we want our proxy to look after hard errors?
The first reason is: we get a nice, unified error page. It’s important that customers always get a nice error page. If you’ve got a lot of microservices out there in the cloud, you don’t want to have to have all of your microservices have some code in them that deals with displaying an error in a nice way. That gets even more complicated when you’re dealing with multilingual sites and instead of having a very simple error page, you have to have 20 different error pages because you have 20 different languages to support.
A couple things are really nice about this. Our apps can go back to a process of just responding with a 400
or a 500
error code when it truly is a 500
error. We can get it logged properly in the application. We can get it logged properly within our proxy, and still make sure that we’re sending a nice error page back to the customer.
The other thing is, with 500
errors, because we know that the proxy is going to be responding with a nice error page, we can let our app teams send stack traces out on those errors since we know that those are not going to propagate out to the customer.
The last benefit is the application developers just don’t have to worry about errors. They get to do the things that they are paid to do. They don’t have to worry about the errors, and making sure that error pages look right.
So that’s why we handle hard errors at the proxy level.
20:54 Reacting to Soft Errors
In terms of soft errors, I’m talking about requests that the app may not be able to handle. You don’t want those requests to go to the app, so you added a whole bunch of lines to your NGINX config to do that. For instance, if a particular query string is matched, you don’t send the requests to the cloud, you send it back to your data center, because you know your data center can handle it.
That may work in the short term, but it creates this tight coupling between NGINX and your app or microservice, and we all know we should probably be trying to architect for fairly decoupled systems.
An example we had with Expedia was we needed to move our hotel‑search page from an old URL pattern, which included old URL patterns of their query strings, to a new URL pattern that was a completely different set of query strings.
What we ended up doing was building a microservice that could just do that translation. But hotel searches are actually quite complicated for various things, so there were certain features of the hotel search that we couldn’t easily translate. So what we did was we just said, “All of the traffic for the old pattern is going to go to this microservice, and when the microservice itself can’t do the response, we’ll have the microservice send an error back to the customer, and we’ll do that with a 302
or a 307
“. That’s how we originally implemented this. It was handled with the redirect.
So normally what you do is add a query‑string parameter like nocloud or noapp or similar, and then key off of that within your NGINX config.
When NGINX sees that query‑string parameter, you can just short‑circuit all of your routing and route it all the way back to the data center. So, that’s one approach.
Another reason to use soft errors is that maybe as you’re migrating your applications to the cloud, you want to get things up quick. You might want to build 80% of the features for now, and the last 20% you’re going to build over time. This is another reason you could use this soft‑error approach.
This is what it would look like if we handle soft errors with a 302
.
Your request would come in for /myapp
. The NGINX proxy is going to route it through your cloud. The cloud app, for some reason, can’t handle the request so it responds with a 302
and the location in that 302
header is the same request but with a question mark and noapp=1.
That goes all the way back to the browser, the browser re‑requests the new page. The NGINX proxy grabs it and says, “Oh, that’s got a noapp=1 on it, so I’m going to route it to the data center.”
That’s one approach to it, but there’s actually a better one.
24:11 Soft Errors – A Better Approach
We can react to soft errors within the proxy itself.
What happens is the browser is still going to make that initial request which will go to the app, but instead of responding with a 302
, the app will respond with a special error code. At Expedia, we use a nonstandard HTTP error code, which I’ll show you later.
Then, the proxy is going to use its error‑handling system to repeat the exact same request, but route it to the data center. The customer gets the same result, but we skip the step of the browser having to do all of the 302
and 307
work. If you’re dealing with CDNs, especially if you’re across the world, you can reduce significant latency this way, because the requests don’t have to go all the way back to the browser.
This is what the configuration for that approach looks like.
What we do is we’re going to use NGINX’s error handling. We set an error_page
for a 352
, which is our nonstandard error code. When we get a 352
, we’re going to send it to our location named @352retry. And then on a 404
, we’re going to send it to our @404fallback location.
We have the proxy_intercept_errors
turned on, which is crucial. Without proxy_intercept_errors
, none of this works – all of this error handling just doesn’t happen.
The actual location
blocks are pretty straightforward as well. So the reason I have set
$original_error_code
is to help log the original error. When NGINX makes the secondary request for your error handler, it’s going to respond, and the response that you get from the error handler is actually going to be what’s logged. So, if you don’t log your original error code, you’re going to lose what was going on with the original request.
There’s lots of other things you could capture here as well. If there’s some headers or something else that you wanted to grab, you can also log those.
The proxy_pass
line is pretty straightforward. We’re going to route to our datacenter upstream, and then we’re going to use the $uri
variable, which is just the original request URI. We’re going to use $is_args
, which will give us a question mark if there were arguments, and it’ll be blank if there are not arguments. And then we’ll use $args
for the actual arguments.
The @404fallback location
block works really similarly. We’re going to log the 404
, and then we’re going route to the 404ErrorPage.
27:00 Summary of Lessons from Expedia’s Cloud Migration
So that wraps up the three big steps, the three pillars we have for moving to the cloud.
We really want that multi‑region resiliency to make sure that when something goes wrong with an app, we auto‑route that traffic to a different region and get a latent response, rather than a bad response.
The split_clients
config that we did to avoid that knife edge is so important when you’re dealing with microservices. Especially in situations where you have significant amounts of existing traffic, it’s so important that you don’t just flip all that traffic over to a new service at once. We’ve seen it countless times where if you just flip it over, you’re going to break something, and you just want to avoid that.
And then this whole idea of soft and hard errors really is just using the NGINX config to its potential.
See for yourself how NGINX Plus can smooth your cloud migration – start your free 30‑day trial today or contact us for a live demo.
The post How Expedia Uses NGINX for Cloud Migration at Scale appeared first on NGINX.