10

Implementing Resilient Applications

Vision

Your microservice and cloud based applications must embrace the partial failures that will certainly occur eventually. You need to design your application so it will be resilient to those partial failures.

Resiliency is the ability to recover from failures and continue to function. It is not about avoiding failures, but accepting the fact that failures will happen and responding to them in a way that avoids downtime or data loss. The goal of resiliency is to return the application to a fully functioning state after a failure.

It is challenging enough to design and deploy a microservices-based application. But you also need to keep your application running in an environment where some sort of failure is certain. Therefore, your application should be resilient. It should be designed to cope with partial failures, like network outages or nodes or VMs crashing in the cloud. Even microservices (containers) being moved to a different node within a cluster can cause intermittent short failures within the application.

The many individual components of your application should also incorporate health monitoring features. By following the guidelines in this chapter, you can create an application that can work smoothly in spite of transient downtime or the normal hiccups that occur in complex and cloud-based deployments.

Handling partial failure

In distributed systems like microservices-based applications, there is an ever-present risk of partial failure. For instance, a single microservice/container can fail or might not be available to respond for a short time, or a single VM or server can crash. Since clients and services are separate processes, a service might not be able to respond in a timely way to a client’s request. The service might be overloaded and responding extremely slowly to requests, or might simply not be accessible for a short time because of network issues.

For example, consider the Order details page from the eShopOnContainers sample application. If the ordering microservice is unresponsive when the user tries to submit an order, a bad implementation of the client process (the MVC web application)—for example, if the client code were to use synchronous RPCs with no timeout—would block threads indefinitely waiting for a response. In addition to creating a bad user experience, every unresponsive wait consumes or blocks a thread, and threads are extremely valuable in highly scalable applications. If there are many blocked threads, eventually the application’s runtime can run out of threads. In that case, the application can become globally unresponsive instead of just partially unresponsive, as show in Figure 10-1.

image

Figure 10-1. Partial failures because of dependencies that impact service thread availability

In a large microservices-based application, any partial failure can be amplified, especially if most of the internal microservices interaction is based on synchronous HTTP calls (which is considered an anti-pattern). Think about a system that receives millions of incoming calls per day. If your system has a bad design that is based on long chains of synchronous HTTP calls, these incoming calls might result in many more millions of outgoing calls (let’s suppose a ratio of 1:4) to dozens of internal microservices as synchronous dependencies. This situation is shown in Figure 10-2, especially dependency #3.

image

Figure 10-2. The impact of having an incorrect design featuring long chains of HTTP requests

Intermittent failure is virtually guaranteed in a distributed and cloud based system, even if every dependency itself has excellent availability. This should be a fact you need to consider.

If you do not design and implement techniques to ensure fault tolerance, even small downtimes can be amplified. As an example, 50 dependencies each with 99.99% of availability would result in several hours of downtime each month because of this ripple effect. When a microservice dependency fails while handling a high volume of requests, that failure can quickly saturate all available request threads in each service and crash the whole application.

image

Figure 10-3. Partial failure amplified by microservices with long chains of synchronous HTTP calls

To minimize this problem, in the section “Asynchronous microservice integration enforce microservice’s autonomy” (in the architecture chapter), we encouraged you to use asynchronous communication across the internal microservices. We briefly explain more in the next section.

In addition, it is essential that you design your microservices and client applications to handle partial failures—that is, to build resilient microservices and client applications.

Strategies for handling partial failure

Strategies for dealing with partial failures include the following.

Use asynchronous communication (for example, message-based communication) across internal microservices. It is highly advisable not to create long chains of synchronous HTTP calls across the internal microservices because that incorrect design will eventually become the main cause of bad outages. On the contrary, except for the front-end communications between the client applications and the first level of microservices or fine-grained API Gateways, it is recommended to use only asynchronous (message-based) communication once past the initial request/response cycle, across the internal microservices. Eventual consistency and event-driven architectures will help to minimize ripple effects. These approaches enforce a higher level of microservice autonomy and therefore prevent against the problem noted here.

Use retries with exponential backoff. This technique helps to avoid short and intermittent failures by performing call retries a certain number of times, in case the service was not available only for a short time. This might occur due to intermittent network issues or when a microservice/container is moved to a different node in a cluster. However, if these retries are not if not designed properly with circuit breakers, it can aggravate the ripple effects, ultimately even causing a Denial of Service (DoS).

Work around network timeouts. In general, clients should be designed not to block indefinitely and to always use timeouts when waiting for a response. Using timeouts ensures that resources are never tied up indefinitely.

Use the Circuit Breaker pattern. In this approach, the client process tracks the number of failed requests. If the error rate exceeds a configured limit, a “circuit breaker” trips so that further attempts fail immediately. (If a large number of requests are failing, that suggests the service is unavailable and that sending requests is pointless.) After a timeout period, the client should try again and, if the new requests are successful, close the circuit breaker.

Provide fallbacks. In this approach, the client process performs fallback logic when a request fails, such as returning cached data or a default value. This is an approach suitable for queries, and is more complex for updates or commands.

Limit the number of queued requests. Clients should also impose an upper bound on the number of outstanding requests that a client microservice can send to a particular service. If the limit has been reached, it is probably pointless to make additional requests, and those attempts should fail immediately. In terms of implementation, the Polly Bulkhead Isolation policy can be used to fulfil this requirement. This approach is essentially a parallelization throttle with SemaphoreSlim as the implementation. It also permits a “queue” outside the bulkhead. You can proactively shed excess load even before execution (for example, because capacity is deemed full). This makes its response to certain failure scenarios faster than a circuit breaker would be, since the circuit breaker waits for the failures. The BulkheadPolicy object in Polly exposes how full the bulkhead and queue are, and offers events on overflow so can also be used to drive automated horizontal scaling.

Additional resources

Implementing retries with exponential backoff

Retries with exponential backoff is a technique that attempts to retry an operation, with an exponentially increasing wait time, until a maximum retry count has been reached (the exponential backoff). This technique embraces the fact that cloud resources might intermittently be unavailable for more than a few seconds for any reason. For example, an orchestrator might be moving a container to another node in a cluster for load balancing. During that time, some requests might fail. Another example could be a database like SQL Azure, where a database can be moved to another server for load balancing, causing the database to be unavailable for a few seconds.

There are many approaches to implement retries logic with exponential backoff.

Implementing resilient Entity Framework Core SQL connections

For Azure SQL DB, Entity Framework Core already provides internal database connection resiliency and retry logic. But you need to enable the Entity Framework execution strategy for each DbContext connection if you want to have resilient EF Core connections.

For instance, the following code at the EF Core connection level enables resilient SQL connections that are retried if the connection fails.

// Startup.cs from any ASP.NET Core Web API

public classStartup

{

// Other code …

publicIServiceProvider ConfigureServices(IServiceCollection services)

{

// …

services.AddDbContext<OrderingContext>(options =>

{

options.UseSqlServer(Configuration[“ConnectionString”],

sqlServerOptionsAction: sqlOptions =>

{

sqlOptions.EnableRetryOnFailure(

 maxRetryCount: 5,

maxRetryDelay: TimeSpan.FromSeconds(30),

errorNumbersToAdd: null);

                                     });

});

}

//…

Execution strategies and explicit transactions using BeginTransaction and multiple DbContexts

When retries are enabled in EF Core connections, each operation you perform using EF Core becomes its own retriable operation. Each query and each call to SaveChanges will be retried as a unit if a transient failure occurs.

However, if your code initiates a transaction using BeginTransaction, you are defining your own group of operations that need to be treated as a unit—everything inside the transaction has be rolled back if a failure occurs. You will see an exception like the following if you attempt to execute that transaction when using an EF execution strategy (retry policy) and you include several SaveChanges calls from multiple DbContexts in the transaction.

System.InvalidOperationException: The configured execution strategy ‘SqlServerRetryingExecutionStrategy’ does not support user initiated transactions. Use the execution strategy returned by ‘DbContext.Database.CreateExecutionStrategy()’ to execute all the operations in the transaction as a retriable unit.

The solution is to manually invoke the EF execution strategy with a delegate representing everything that needs to be executed. If a transient failure occurs, the execution strategy will invoke the delegate again. For example, the following code show how it is implemented in eShopOnContainers with two multiple DbContexts (_catalogContext and the IntegrationEventLogContext) when updating a product and then saving the ProductPriceChangedIntegrationEvent object, which needs to use a different DbContext.

public async Task<IActionResult> UpdateProduct([FromBody]CatalogItem

productToUpdate)

{

// Other code …

// Update current product

catalogItem = productToUpdate;

// Use of an EF Core resiliency strategy when using multiple DbContexts

// within an explicit transaction

// See:

// https://docs.microsoft.com/en-us/ef/core/miscellaneous/connection-resiliency

varstrategy = _catalogContext.Database.CreateExecutionStrategy();

awaitstrategy.ExecuteAsync(async() =>

{

// Achieving atomicity between original Catalog database operation and the

// IntegrationEventLog thanks to a local transaction

using (var transaction = _catalogContext.Database.BeginTransaction())

{

_catalogContext.CatalogItems.Update(catalogItem);

await _catalogContext.SaveChangesAsync();

// Save to EventLog only if product price changed

if (raiseProductPriceChangedEvent)

await _integrationEventLogService.SaveEventAsync(priceChangedEvent);

 

transaction.Commit();

}

});

The first DbContext is _catalogContext and the second DbContext is within the _integrationEventLogService object. The Commit action is performed across multiple DbContexts using an EF execution strategy.

Additional resources

Implementing custom HTTP call retries with exponential backoff

In order to create resilient microservices, you need to handle possible HTTP failure scenarios. For that purpose, you could create your own implementation of retries with exponential backoff.

In addition to handling temporal resource unavailability, the exponential backoff also needs to take into account that the cloud provider might throttle availability of resources to prevent usage overload. For example, creating too many connection requests very quickly might be viewed as a Denial of Service (DoS) attack by the cloud provider. As a result, you need to provide a mechanism to scale back connection requests when a capacity threshold has been encountered.

As an initial exploration, you could implement your own code with a utility class for exponential backoff as in RetryWithExponentialBackoff.cs, plus code like the following (which is also available on a GitHub repo).

public sealed class RetryWithExponentialBackoff

{

private readonly int maxRetries, delayMilliseconds, maxDelayMilliseconds;

public RetryWithExponentialBackoff(int maxRetries = 50,

int delayMilliseconds = 200,

int maxDelayMilliseconds = 2000)

{

this.maxRetries = maxRetries;

this.delayMilliseconds = delayMilliseconds;

this.maxDelayMilliseconds = maxDelayMilliseconds;

}

public async Task RunAsync(Func<Task> func)

{

ExponentialBackoff backoff = new ExponentialBackoff(this.maxRetries,

this.delayMilliseconds,

this.maxDelayMilliseconds);

retry:

try

{

await func();

}

catch (Exception ex) when (ex is TimeoutException ||

ex is System.Net.Http.HttpRequestException)

{

Debug.WriteLine(“Exception raised is: ” +

ex.GetType().ToString() +

” –Message: ” + ex.Message +

” — Inner Message: ” +

ex.InnerException.Message);

await backoff.Delay();

goto retry;

}

}

}

public struct ExponentialBackoff

{

private readonly int m_maxRetries, m_delayMilliseconds, m_maxDelayMilliseconds;

private int m_retries, m_pow;

public ExponentialBackoff(int maxRetries, int delayMilliseconds,

int maxDelayMilliseconds)

{

m_maxRetries = maxRetries;

m_delayMilliseconds = delayMilliseconds;

m_maxDelayMilliseconds = maxDelayMilliseconds;

m_retries = 0;

m_pow = 1;

}

public Task Delay()

{

if (m_retries == m_maxRetries)

{

throw new TimeoutException(“Max retry attempts exceeded.”);

}

++m_retries;

if (m_retries < 31)

{

m_pow = m_pow << 1; // m_pow = Pow(2, m_retries – 1)

}

int delay = Math.Min(m_delayMilliseconds * (m_pow – 1) / 2,

m_maxDelayMilliseconds);

return Task.Delay(delay);

}

}

Using this code in a client C# application (another Web API client microservice, an ASP.NET MVC application, or even a C# Xamarin application) is straightforward. The following example shows how, using the HttpClient class.

public async Task<Catalog> GetCatalogItems(int page,int take, int? brand, int? type)

{

_apiClient = new HttpClient();

var itemsQs = $”items?pageIndex={page}&pageSize={take}”;

var filterQs = “”;

var catalogUrl =

$”{_remoteServiceBaseUrl}items{filterQs}?pageIndex={page}&pageSize={take}”;

var dataString = “”;

//

// Using HttpClient with Retry and Exponential Backoff

//

var retry = new RetryWithExponentialBackoff();

await retry.RunAsync(async () =>

{

// work with HttpClient call

dataString = await _apiClient.GetStringAsync(catalogUrl);

});   

return JsonConvert.DeserializeObject<Catalog>(dataString);

}

However, this code is suitable only as a proof of concept. The next section explains how to use more sophisticated and proven libraries.

Implementing HTTP call retries with exponential backoff with Polly

The recommended approach for retries with exponential backoff is to take advantage of more advanced .NET libraries like the open source Polly library.

Polly is a .NET library that provides resilience and transient-fault handling capabilities. You can implement those capabilities easily by applying Polly policies such as Retry, Circuit Breaker, Bulkhead Isolation, Timeout, and Fallback. Polly targets .NET 4.x and the .NET Standard Library 1.0 (which supports .NET Core).

The Retry policy in Polly is the approach used in eShopOnContainers when implementing HTTP retries. You can implement an interface so you can inject either standard HttpClient functionality or a resilient version of HttpClient using Polly, depending on what retry policy configuration you want to use.

The following example shows the interface implemented in eShopOnContainers.

public interface IHttpClient

{

Task<string> GetStringAsync(string uri, string authorizationToken = null,

string authorizationMethod = “Bearer”);

Task<HttpResponseMessage> PostAsync<T>(string uri, T item,

string authorizationToken = null, string requestId = null,

string authorizationMethod = “Bearer”);

Task<HttpResponseMessage> DeleteAsync(string uri,

string authorizationToken = null, string requestId = null,

string authorizationMethod = “Bearer”);

// Other methods …

}

You can use the standard implementation if you do not want to use a resilient mechanism, as when you are developing or testing simpler approaches. The following code shows the standard HttpClient implementation allowing requests with authentication tokens as an optional case.

public class StandardHttpClient : IHttpClient

{

privateHttpClient _client;

privateILogger<StandardHttpClient> _logger;

public StandardHttpClient(ILogger<StandardHttpClient> logger)

{

_client = newHttpClient();

_logger = logger;

}

public async Task<string> GetStringAsync(string uri,

string authorizationToken = null,

string authorizationMethod = “Bearer”)

{

var requestMessage = newHttpRequestMessage(HttpMethod.Get, uri);

if (authorizationToken != null)

{

requestMessage.Headers.Authorization =

newAuthenticationHeaderValue(authorizationMethod, authorizationToken);

}

var response = await_client.SendAsync(requestMessage);

return await response.Content.ReadAsStringAsync();

}

public asyncTask<HttpResponseMessage> PostAsync<T>(string uri, T item,

string authorizationToken = null, string requestId = null,

string authorizationMethod = “Bearer”)

{

// Rest of the code and other Http methods …

The interesting implementation is to code another, similar class, but using Polly to implement the resilient mechanisms you want to use—in the following example, retries with exponential backoff.

public class ResilientHttpClient : IHttpClient

{

privateHttpClient _client;

privatePolicyWrap _policyWrapper;

privateILogger<ResilientHttpClient> _logger;

public ResilientHttpClient(Policy[] policies,

ILogger<ResilientHttpClient> logger)

{

_client = newHttpClient();

_logger = logger;

// Add Policies to be applied

_policyWrapper = Policy.WrapAsync(policies);

}

privateTask<T> HttpInvoker<T>(Func<Task<T>> action)

{

// Executes the action applying all

// the policies defined in the wrapper

return_policyWrapper.ExecuteAsync(() => action());

}

publicTask<string> GetStringAsync(string uri,

string authorizationToken = null,

string authorizationMethod = “Bearer”)

{

returnHttpInvoker(async () =>

{

var requestMessage = newHttpRequestMessage(HttpMethod.Get, uri);

// The Token’s related code eliminated for clarity in code snippet

var response = await_client.SendAsync(requestMessage);

return await response.Content.ReadAsStringAsync();

});

}

// Other Http methods executed through HttpInvoker so it applies Polly policies

// …

}

With Polly, you define a Retry policy with the number of retries, the exponential backoff configuration, and the actions to take when there is an HTTP exception, such as logging the error. In this case, the policy is configured so it will try the number of times specified when registering the types in the IoC container. Because of the exponential backoff configuration, whenever the code detects an HttpRequest exception, it retries the Http request after waiting an amount of time that increases exponentially depending on how the policy was configured.

The important method is HttpInvoker, which is what makes HTTP requests throughout this utility class. That method internally executes the HTTP request with _policyWrapper.ExecuteAsync, which takes into account the retry policy.

In eShopOnContainers you specify Polly policies when registering the types at the IoC container, as in the following code from the MVC web app at the startup.cs class.

// Startup.cs class

if (Configuration.GetValue<string>(“UseResilientHttp”) == bool.TrueString)

{

services.AddTransient<IResilientHttpClientFactory,

ResilientHttpClientFactory>();

services.AddSingleton<IHttpClient,

ResilientHttpClient>(sp =>

sp.GetService<IResilientHttpClientFactory>().

CreateResilientHttpClient());

}

else

{

services.AddSingleton<IHttpClient, StandardHttpClient>();

}

Note that the IHttpClient objects are instantiated as singleton instead of as transient so that TCP connections are used efficiently by the service and an issue with sockets will not occur.

But the important point about resiliency is that you apply the Polly WaitAndRetryAsync policy within ResilientHttpClientFactory in the CreateResilientHttpClient method, as shown in the following code:

public  ResilientHttpClient CreateResilientHttpClient()

=> newResilientHttpClient(CreatePolicies(), _logger);

// Other code

 

privatePolicy[] CreatePolicies()

=> newPolicy[]

{

Policy.Handle<HttpRequestException>()                

.WaitAndRetryAsync(

// number of retries

6,

// exponential backoff

retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),

// on retry

(exception, timeSpan, retryCount, context) =>

{

var msg = $”Retry {retryCount} implemented with Pollys

RetryPolicy ” +

$”of {context.PolicyKey} +

$”at {context.ExecutionKey}, “ +

$”dueto: {exception}.”;

_logger.LogWarning(msg);

_logger.LogDebug(msg);

}),

}

Implementing the Circuit Breaker pattern

As noted earlier, you should handle faults that might take a variable amount of time to recover from, as might happen when you try to connect to a remote service or resource. Handling this type of fault can improve the stability and resiliency of an application.

In a distributed environment, calls to remote resources and services can fail due to transient faults, such as slow network connections and timeouts, or if resources are being slow or are temporarily unavailable. These faults typically correct themselves after a short time, and a robust cloud application should be prepared to handle them by using a strategy like the Retry pattern.

However, there can also be situations where faults are due to unanticipated events that might take much longer to fix. These faults can range in severity from a partial loss of connectivity to the complete failure of a service. In these situations, it might be pointless for an application to continually retry an operation that is unlikely to succeed. Instead, the application should be coded to accept that the operation has failed and handle the failure accordingly.

The Circuit Breaker pattern has a different purpose than the Retry pattern. The Retry pattern enables an application to retry an operation in the expectation that the operation will eventually succeed. The Circuit Breaker pattern prevents an application from performing an operation that is likely to fail. An application can combine these two patterns by using the Retry pattern to invoke an operation through a circuit breaker. However, the retry logic should be sensitive to any exceptions returned by the circuit breaker, and it should abandon retry attempts if the circuit breaker indicates that a fault is not transient.

Implementing a Circuit Breaker pattern with Polly

As when implementing retries, the recommended approach for circuit breakers is to take advantage of proven .NET libraries like Polly.

The eShopOnContainers application uses the Polly Circuit Breaker policy when implementing HTTP retries. In fact, the application applies both policies to the ResilientHttpClient utility class. Whenever you use an object of type ResilientHttpClient for HTTP requests (from eShopOnContainers), you will be applying both those policies, but you could add additional policies, too.

The only addition here to the code used for HTTP call retries is the code where you add the Circuit Breaker policy to the list of policies to use, as shown at the end of the following code:

public  ResilientHttpClient CreateResilientHttpClient()

=> newResilientHttpClient(CreatePolicies(), _logger);

 

privatePolicy[] CreatePolicies()

=> newPolicy[]

{

Policy.Handle<HttpRequestException>()

.WaitAndRetryAsync(

// number of retries

6,

// exponential backoff

retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),

// on retry

(exception, timeSpan, retryCount, context) =>

{

var msg = $”Retry {retryCount} implemented with Polly

RetryPolicy ” +

$”of {context.PolicyKey} +

$”at {context.ExecutionKey}, “ +

$”due to: {exception}.”;

_logger.LogWarning(msg);

_logger.LogDebug(msg);

}),

Policy.Handle<HttpRequestException>()

.CircuitBreakerAsync(

// number of exceptions before breaking circuit

5,

// time circuit opened before retry

TimeSpan.FromMinutes(1),

(exception, duration) =>

{

// on circuit opened

_logger.LogTrace(“Circuit breaker opened”);

},

() =>

{

// on circuit closed

_logger.LogTrace(“Circuit breakerreset”);

})};

}

The code adds a policy to the HTTP wrapper. That policy defines a circuit breaker that opens when the code detects the specified number of consecutive exceptions (exceptions in a row), as passed in the exceptionsAllowedBeforeBreaking parameter (5 in this case). When the circuit is open, HTTP requests do not work, but an exception is raised.

Circuit breakers should also be used to redirect requests to a fallback infrastructure if you might have issues in a particular resource that is deployed in a different environment than the client application or service that is performing the HTTP call. That way, if there is an outage in the datacenter that impacts only your backend microservices but not your client applications, the client applications can redirect to the fallback services. Polly is planning a new policy to automate this failover policy scenario.

Of course, all those features are for cases where you are managing the failover from within the .NET code, as opposed to having it managed automatically for you by Azure, with location transparency.

Using the ResilientHttpClient utility class from eShopOnContainers

You use the ResilientHttpClient utility class in a way similar to how you use the .NET HttpClient class. In the following example from the eShopOnContainers MVC web application (the OrderingService agent class used by OrderController), the ResilientHttpClient object is injected through the httpClient parameter of the constructor. Then the object is used to perform HTTP requests.

public class OrderingService : IOrderingService

{

privateIHttpClient _apiClient;

private readonly string _remoteServiceBaseUrl;

private readonlyIOptionsSnapshot<AppSettings> _settings;

private readonlyIHttpContextAccessor _httpContextAccesor;

publicOrderingService(IOptionsSnapshot<AppSettings> settings,

IHttpContextAccessor httpContextAccesor,

IHttpClient httpClient)

{

_remoteServiceBaseUrl = $”{settings.Value.OrderingUrl}/api/v1/orders”;

_settings = settings;

_httpContextAccesor = httpContextAccesor;

_apiClient = httpClient;

}

async publicTask<List<Order>> GetMyOrders(ApplicationUser user)

{

var context = _httpContextAccesor.HttpContext;

var token = await context.Authentication.GetTokenAsync(“access_token”);

_apiClient.Inst.DefaultRequestHeaders.Authorization = new

System.Net.Http.Headers.AuthenticationHeaderValue(“Bearer”, token);

var ordersUrl = _remoteServiceBaseUrl;

var dataString = await _apiClient.GetStringAsync(ordersUrl);

var response = JsonConvert.DeserializeObject<List<Order>>(dataString);

return response;

}

// Other methods …

async public TaskCreateOrder(Order order)

{

var context = _httpContextAccesor.HttpContext;

var token = await context.Authentication.GetTokenAsync(“access_token”);

_apiClient.Inst.DefaultRequestHeaders.Authorization = new

System.Net.Http.Headers.AuthenticationHeaderValue(“Bearer”, token);

_apiClient.Inst.DefaultRequestHeaders.Add(“x-requestid”,

order.RequestId.ToString());

var ordersUrl = $”{_remoteServiceBaseUrl}/new”;

order.CardTypeId = 1;

order.CardExpirationApiFormat();

SetFakeIdToProducts(order);

var response = await _apiClient.PostAsync(ordersUrl, order);

response.EnsureSuccessStatusCode();

}

}

Whenever the _apiClient member object is used, it internally uses the wrapper class with Polly policiesؙ—the Retry policy, the Circuit Breaker policy, and any other policy that you might want to apply from the Polly policies collection.

Testing retries in eShopOnContainers

Whenever you start the eShopOnContainers solution in a Docker host, it needs to start multiple containers. Some of the containers are slower to start and initialize, like the SQL Server container. This is especially true the first time you deploy the eShopOnContainers application into Docker, because it needs to set up the images and the database. The fact that some containers start slower than others can cause the rest of the services to initially throw HTTP exceptions, even if you set dependencies between containers at the docker-compose level, as explained in previous sections. Those docker-compose dependencies between containers are just at the process level. The container’s entry point process might be started, but SQL Server might not be ready for queries. The result can be a cascade of errors, and the application can get an exception when trying to consume that particular container.

You might also see this type of error on startup when the application is deploying to the cloud. In that case, orchestrators might be moving containers from one node or VM to another (that is, starting new instances) when balancing the number of containers across the cluster’s nodes.

The way eShopOnContainers solves this issue is by using the Retry pattern we illustrated earlier. It is also why, when starting the solution, you might get log traces or warnings like the following:

Retry 1 implemented with Polly’s RetryPolicy, due to: System.Net.Http.HttpRequestException: An error occurred while sending the request. —> System.Net.Http.CurlException: Couldn’t connect to server\n   at System.Net.Http.CurlHandler.ThrowIfCURLEError(CURLcode error)\n   at  […].”

Testing the circuit breaker in eShopOnContainers

There are a few ways you can open the circuit and test it with eShopOnContainers.

One option is to lower the allowed number of retries to 1 in the circuit breaker policy and redeploy the whole solution into Docker. With a single retry, there is a good chance that an HTTP request will fail during deployment, the circuit breaker will open, and you get an error.

Another option is to use custom middleware that is implemented in the ordering microservice. When this middleware is enabled, it catches all HTTP requests and returns status code 500. You can enable the middleware by making a GET request to the failing URI, like the following:

This request returns the current state of the middleware. If the middleware is enabled, the request return status code 500. If the middleware is disabled, there is no response.

This request enables the middleware.

This request disables the middleware.

For instance, once the application is running, you can enable the middleware by making a request using the following URI in any browser. Note that the ordering microservice uses port 5102.

http://localhost:5102/failing?enable

You can then check the status using the URI http://localhost:5102/failing, as shown in Figure 10-4.

image

Figure 10-4. Simulating a failure with ASP.NET middleware

At this point, the ordering microservice responds with status code 500 whenever you call invoke it.

Once the middleware is running, you can try making an order from the MVC web application. Because the requests fails, the circuit will open.

In the following example, you can see that the MVC web application has a catch block in the logic for placing an order.  If the code catches an open-circuit exception, it shows the user a friendly message telling them to wait.

[HttpPost]

public asyncTask<IActionResult> Create(Order model, string action)

{

try

{

if (ModelState.IsValid)

{

var user = _appUserParser.Parse(HttpContext.User);

await_orderSvc.CreateOrder(model);

//Redirect to historic list.

return RedirectToAction(“Index”);

}

}

catch(BrokenCircuitExceptionex)

{

ModelState.AddModelError(“Error”,

“It was not possible to create a new order, please try later on”);

}

return View(model);

}

Here’s a summary. The Retry policy tries several times to make the HTTP request and gets HTTP errors. When the number of tries reaches the maximum number set for the Circuit Breaker policy (in this case, 5), the application throws a BrokenCircuitException. The result is a friendly message, as shown in Figure 10-5.

image

Figure 10-5. Circuit breaker returning an error to the UI

You can implement different logic for when to open the circuit. Or you can try an HTTP request against a different back-end microservice if there is a fallback datacenter or redundant back-end system.

Finally, another possibility for the CircuitBreakerPolicy is to use Isolate (which forces open and holds open the circuit) and Reset (which closes it again). These could be used to build a utility HTTP endpoint that invokes Isolate and Reset directly on the policy.  Such an HTTP endpoint could also be used, suitably secured, in production for temporarily isolating a downstream system, such as when you want to upgrade it. Or it could trip the circuit manually to protect a downstream system you suspect to be faulting.

Adding a jitter strategy to the retry policy

A regular Retry policy can impact your system in cases of high concurrency and scalability and under high contention. To overcome peaks of similar retries coming from many clients in case of partial outages, a good workaround is to add a jitter strategy to the retry algorithm/policy. This can improve the overall performance of the end-to-end system by adding randomness to the exponential backoff. This spreads out the spikes when issues arise. When you use Polly, code to implement jitter could look like the following example:

Random jitterer = new Random();

Policy

.Handle<HttpResponseException>() // etc

.WaitAndRetry(5,    // exponential back-off plus some jitter

retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))

+ TimeSpan.FromMilliseconds(jitterer.Next(0, 100))

);

Additional resources

Health monitoring

Health monitoring can allow near-real-time information about the state of your containers and microservices. Health monitoring is critical to multiple aspects of operating microservices and is especially important when orchestrators perform partial application upgrades in phases, as explained later.

Microservices-based applications often use heartbeats or health checks to enable their performance monitors, schedulers, and orchestrators to keep track of the multitude of services. If services cannot send some sort of “I’m alive” signal, either on demand or on a schedule, your application might face risks when you deploy updates, or it might simply detect failures too late and not be able to stop cascading failures that can end up in major outages.

In the typical model, services send reports about their status, and that information is aggregated to provide an overall view of the state of health of your application. If you are using an orchestrator, you can provide health information to your orchestrator’s cluster, so that the cluster can act accordingly. If you invest in high-quality health reporting that is customized for your application, you can detect and fix issues for your running application much more easily.

Implementing health checks in ASP.NET Core services

When developing an ASP.NET Core microservice or web application, you can use a library named HealthChecks from the ASP.NET team. (As of May 2017, an early release is available on GitHub ).

This library is easy to use and provides features that let you validate that any specific external resource needed for your application (like a SQL Server database or remote API) is working properly. When you use this library, you can also decide what it means that the resource is healthy, as we explain later.

In order to use this library, you need to first use the library in your microservices. Second, you need a front-end application that queries for the health reports. That front end application could be a custom reporting application, or it could be an orchestrator itself that can react accordingly to the health states.

Using the HealthChecks library in your back end ASP.NET microservices

You can see how the HealthChecks library is used in the eShopOnContainers sample application. To begin, you need to define what constitutes a healthy status for each microservice. In the sample application, the microservices are healthy if the microservice API is accessible via HTTP and if its related SQL Server database is also available.

In the future, you will be able to install the HealthChecks library as a NuGet package. But as of this writing, you need to download and compile the code as part of your solution. Clone the code available at https://github.com/aspnet/HealthChecks  and copy the following folders to your solution.

src/common

src/Microsoft.AspNetCore.HealthChecks

src/Microsoft.Extensions.HealthChecks

src/Microsoft.Extensions.HealthChecks.SqlServer

You could also use additional checks like the ones for Azure (Microsoft.Extensions.HealthChecks.AzureStorage), but since this version of eShopOnContainers does not have any dependency on Azure, you do not need it. You do not need the ASP.NET health checks, because eShopOnContainers is based on ASP.NET Core.

Figure 10-6 shows the HealthChecks library in Visual Studio, ready to be used as a building block by any microservices.

image

Figure 10-6. ASP.NET Core HealthChecks library source code in a Visual Studio solution

As introduced earlier, the first thing to do in each microservice project is to add a reference to the three HealthChecks libraries. After that, you add the health check actions that you want to perform in that microservice. These actions are basically dependencies on other microservices (HttpUrlCheck) or databases (currently SqlCheck* for SQL Server databases). You add the action within the Startup class of each ASP.NET microservice or ASP.NET web application.

Each service or web application should be configured by adding all its HTTP or database dependencies as one AddHealthCheck method. For example, the MVC web application from eShopOnContainers depends on many services, therefore has several AddCheck methods added to the health checks.

For instance, in the following code you can see how the catalog microservice adds a dependency on its SQL Server database.

// Startup.cs from Catalog.api microservice

//

public classStartup

{

public void ConfigureServices(IServiceCollection services)

{

// Add framework services

services.AddHealthChecks(checks =>

{

checks.AddSqlCheck(“CatalogDb”, Configuration[“ConnectionString”]);

});

// Other services

}

}

However, the MVC web application of eShopOnContainers has multiple dependencies on the rest of the microservices. Therefore, it calls one AddUrlCheck method for each microservice, as shown in the following example:

// Startup.cs from the MVC web app

public class Startup

{

public void ConfigureServices(IServiceCollection services)

{

services.AddMvc();

services.Configure<AppSettings>(Configuration);

services.AddHealthChecks(checks =>

{

checks.AddUrlCheck(Configuration[“CatalogUrl”]);

checks.AddUrlCheck(Configuration[“OrderingUrl”]);

checks.AddUrlCheck(Configuration[“BasketUrl”]);

checks.AddUrlCheck(Configuration[“IdentityUrl”]);

});

}

}

Thus, a microservice will not provide a “healthy” status until all its checks are healthy as well.

If the microservice does not have a dependency on a service or on SQL Server, you should just add a Healthy(“Ok”) check. The following code is from the eShopOnContainers basket.api microservice. (The basket microservice uses the Redis cache, but the library does not yet include a Redis health check provider.)

services.AddHealthChecks(checks =>

{

checks.AddValueTaskCheck(“HTTP Endpoint”, () => new

ValueTask<IHealthCheckResult>(HealthCheckResult.Healthy(“Ok”)));

});

For a service or web application to expose the health check endpoint, it has to enable the UserHealthChecks([url_for_health_checks]) extension method. This method goes at the WebHostBuilder level in the main method of the Program class of your ASP.NET Core service or web application, right after UseKestrel as shown in the code below.

namespace Microsoft.eShopOnContainers.WebMVC

{

public classProgram

{

public static void Main(string[] args)

{

var host = newWebHostBuilder()

.UseKestrel()

.UseHealthChecks(“/hc”)

.UseContentRoot(Directory.GetCurrentDirectory())

.UseIISIntegration()

.UseStartup<Startup>()

.Build();

host.Run();

}

}

}

The process works like this: each microservice exposes the endpoint /hc. That endpoint is created by the HealthChecks library ASP.NET Core middleware. When that endpoint is invoked, it runs all the health checks that are configured in the AddHealthChecks method in the Startup class.

The UseHealthChecks method expects a port or a path. That port or path is the endpoint to use to check the health state of the service. For instance, the catalog microservice uses the path /hc.

Caching health check responses

Since you do not want to cause a Denial of Service (DoS) in your services, or you simply do not want to impact service performance by checking resources too frequently, you can cache the returns and configure a cache duration for each health check.

By default, the cache duration is internally set to 5 minutes, but you can change that cache duration on each health check, as in the following code:

checks.AddUrlCheck(Configuration[“CatalogUrl”],1);  // 1 min as cache duration

Querying your microservices to report about their health status

When you have configured health checks as described here, once the microservice is running in Docker, you can directly check from a browser if it is healthy. (This does require that you are publishing the container port out of the Docker host, so you can access the container through localhost or through the external Docker host IP.) Figure 10-7 shows a request in a browser and the corresponding response.

image

Figure 10-7. Checking health status of a single service from a browser

In that test, you can see that the catalog.api microservice (running on port 5101) is healthy, returning HTTP status 200 and status information in JSON. It also means that internally the service also checked the health of its SQL Server database dependency and that health check was reported itself as healthy.

Using watchdogs

A watchdog is a separate service that can watch health and load across services, and report health about the microservices by querying with the HealthChecks library introduced earlier. This can help prevent errors that would not be detected based on the view of a single service. Watchdogs also are a good place to host code that can perform remediation actions for known conditions without user interaction.

The eShopOnContainers sample contains a web page that displays sample health check reports, as shown in Figure 10-8. This is the simplest watchdog you could have, since all it does is shows the state of the microservices and web applications in eShopOnContainers. Usually a watchdog also takes actions when it detects unhealthy states.

image

Figure 10-8. Sample health check report in eShopOnContainers

In summary, the ASP.NET middleware of the ASP.NET Core HealthChecks library provides a single health check endpoint for each microservice. This will execute all the health checks defined within it and return an overall health state depending on all those checks.

The HealthChecks library is extensible through new health checks of future external resources. For example, we expect that in the future the library will have health checks for Redis cache and for other databases. The library allows health reporting by multiple service or application dependencies, and you can then take actions based on those health checks.

Health checks when using orchestrators

To monitor the availability of your microservices, orchestrators like Docker Swarm, Kubernetes, and Service Fabric periodically perform health checks by sending requests to test the microservices. When an orchestrator determines that a service/container is unhealthy, it stops routing requests to that instance. It also usually creates a new instance of that container.

For instance, most orchestrators can use health checks to manage zero-downtime deployments. Only when the status of a service/container changes to healthy will the orchestrator start routing traffic to service/container instances.

Health monitoring is especially important when an orchestrator performs an application upgrade. Some orchestrators (like Azure Service Fabric) update services in phases—for example, they might update one-fifth of the cluster surface for each application upgrade. The set of nodes that is upgraded at the same time is referred to as an upgrade domain. After each upgrade domain has been upgraded and is available to users, that upgrade domain must pass health checks before the deployment moves to the next upgrade domain.

Another aspect of service health is reporting metrics from the service. This is an advanced capability of the health model of some orchestrators, like Service Fabric. Metrics are important when using an orchestrator because they are used to balance resource usage. Metrics also can be an indicator of system health. For example, you might have an application that has many microservices, and each instance reports a requests-per-second (RPS) metric. If one service is using more resources (memory, processor, etc.) than another service, the orchestrator could move service instances around in the cluster to try to maintain even resource utilization.

Note that if you are using Azure Service Fabric, it provides its own Health Monitoring model, which is more advanced than simple health checks.

Advanced monitoring: visualization, analysis, and alerts

The final part of monitoring is visualizing the event stream, reporting on service performance, and alerting when an issue is detected. You can use different solutions for this aspect of monitoring.

You can use simple custom applications showing the state of your services, like the custom page we showed when we explained ASP.NET Core HealthChecks. Or you could use more advanced tools like Azure Application Insights and Operations Management Suite to raise alerts based on the stream of events.

Finally, if you were storing all the event streams, you can use Microsoft Power BI or a third-party solution like Kibana or Splunk to visualize the data.

Additional resources