Retries with exponential backoff is a technique that attempts to retry an operation, with an exponentially increasing wait time, until a maximum retry count has been reached (the exponential backoff). This technique embraces the fact that cloud resources might intermittently be unavailable for more than a few seconds for any reason. For example, an orchestrator might be moving a container to another node in a cluster for load balancing. During that time, some requests might fail. Another example could be a database like SQL Azure, where a database can be moved to another server for load balancing, causing the database to be unavailable for a few seconds.
There are many approaches to implement retries logic with exponential backoff.
For Azure SQL DB, Entity Framework Core already provides internal database connection resiliency and retry logic. But you need to enable the Entity Framework execution strategy for each DbContext connection if you want to have resilient EF Core connections.
For instance, the following code at the EF Core connection level enables resilient SQL connections that are retried if the connection fails.
When retries are enabled in EF Core connections, each operation you perform using EF Core becomes its own retryable operation. Each query and each call to SaveChanges will be retried as a unit if a transient failure occurs.
However, if your code initiates a transaction using BeginTransaction, you are defining your own group of operations that need to be treated as a unit—everything inside the transaction has be rolled back if a failure occurs. You will see an exception like the following if you attempt to execute that transaction when using an EF execution strategy (retry policy) and you include several SaveChanges calls from multiple DbContexts in the transaction.
System.InvalidOperationException: The configured execution strategy 'SqlServerRetryingExecutionStrategy' does not support user initiated transactions. Use the execution strategy returned by 'DbContext.Database.CreateExecutionStrategy()' to execute all the operations in the transaction as a retriable unit.
The solution is to manually invoke the EF execution strategy with a delegate representing everything that needs to be executed. If a transient failure occurs, the execution strategy will invoke the delegate again. For example, the following code show how it is implemented in eShopOnContainers with two multiple DbContexts (_catalogContext and the IntegrationEventLogContext) when updating a product and then saving the ProductPriceChangedIntegrationEvent object, which needs to use a different DbContext.
The first DbContext is _catalogContext and the second DbContext is within the _integrationEventLogService object. The Commit action is performed across multiple DbContexts using an EF execution strategy.
In order to create resilient microservices, you would need to handle possible HTTP failure scenarios. For that purpose, you could create your own implementation of retries with exponential backoff.
Important note: This section shows you how you could create your own custom code to implement HTTP call retries. However, it is not recommended to do it by your own but to use more powerful and reliable while simpler to use mechanisms, such as HttpClientFactory with Polly, available since .NET Core 2.1. Those recommended approaches are explained in the next sections.
As an initial exploration, you could implement your own code with a utility class for exponential backoff as in RetryWithExponentialBackoff.cs, plus code like the following (which is also available at this GitHub repo).
Using this code in a client C# application (another Web API client microservice, an ASP.NET MVC application, or even a C# Xamarin application) is straightforward. The following example shows how, using the HttpClient class.
However, and as mentioned, this code is suitable only as a proof of concept. The next sections explain how to use more sophisticated approaches while simpler to use with HttpClientFactory, available since .NET Core 2.1, with proven resiliency libraries like Polly.
HttpClientFactory is an opinionated factory, available since .NET Core 2.1, for creating HttpClient instances to be used in your applications.
The original and well-know HttpClient class can be easily used, but in some cases, it is not being properly used by many developers.
As a first issue, while this class is disposable, using it with the using statement is not the best choice because even when you dispose HttpClient object, the underlying socket is not immediately released and can cause a serious issue named ‘sockets exhaustion’ as explained in this blog post by Simon Timms.
Therefore, HttpClient is intended to be instantiated once and reused throughout the life of an application. Instantiating an HttpClient class for every request will exhaust the number of sockets available under heavy loads. That issue will result in SocketException errors. Possible approaches to solve that problem are based on the creation of the HttpClient object as singleton or static, as explained in this Microsoft article on HttpClient usage.
But there’s a second issue with HttpClient that you can have when you use it as singleton or static object. In this case, a singleton or static HttpClient doesn't respect DNS changes, as explained in this issue at the .NET Core GitHub repo.
To address those mentioned issues and make the management of HttpClient instances easier, .NET Core 2.1 offers a new HTTPClient factory that can also be used to implement resilient HTTP calls by integrating Polly with it.
HttpClientFactory is designed to:
There are several ways that you can use HttpClient factory in your application:
For the sake of brevity, this guidance shows the most structured way to use HttpClientFactory that is to use Typed Clients (Service Agent pattern), but all options are documented and are currently listed in this article covering HttpClientFactory usage.
The following diagram shows how Typed Clients are used with HttpClientFactory.
Figure 8-4. Using HttpClientFactory with Typed Client classes.
The first thing you need to do is to set up HttpClientFactory in your application. Once you have access to the package Microsoft.Extensions.Http, the extension method AddHttpClient() for IServiceCollection is available. This extension method registers the DefaultHttpClientFactory to be used as a singleton for the interface IHttpClientFactory. It defines a transient configuration for the HttpMessageHandlerBuilder. This message handler (HttpMessageHandler object), taken from a pool, is used by the HttpClient returned from the factory.
In the next code, you can see how AddHttpClient() can be used to register Typed Clients (Service Agents) that need to use HttpClient.
Just by adding your Typed Client classes with AddHttpClient(), whenever you use the HttpClient object that will be injected to your class through its constructor, that HttpClient object will be using all the configuration and policies provided. In the next sections, you will see those policies like Polly’s retries or circuit-breakers.
Each time you get an HttpClient object from IHttpClientFactory, a new instance of an HttpClient is returned. There will be an HttpMessageHandler per named of typed client. IHttpClientFactory will pool the HttpMessageHandler instances created by the factory to reduce resource consumption. An HttpMessageHandler instance may be reused from the pool when creating a new HttpClient instance if its lifetime hasn't expired.
Pooling of handlers is desirable as each handler typically manages its own underlying HTTP connections; creating more handlers than necessary can result in connection delays. Some handlers also keep connections open indefinitely, which can prevent the handler from reacting to DNS changes.
The HttpMessageHandler objects in the pool have a lifetime that is the length of time that an HttpMessageHandler instance in the pool can be reused. The default value is two minutes, but it can be overridden per named or typed client basis. To override it, call SetHandlerLifetime() on the IHttpClientBuilder that is returned when creating the client, as shown in the following code.
Each typed client or named client can have its own configured handler lifetime value. Set the lifetime to InfiniteTimeSpan to disable handler expiry.
As a previous step, you need to have your Typed Client classes defined, such as the classes in the sample code, like ‘BasketService’, ‘CatalogService’, ‘OrderingService’, etc. – A typed client is a class that accepts an HttpClient object (injected through its constructor) and uses it to call some remote HTTP service. For example:
The typed client (CatalogService in the example) is activated by DI (Dependency Injection), meaning that it can accept any registered service in its constructor, in addition to HttpClient.
A typed client is, effectively, a transient service, meaning that a new instance is created each time one is needed and it will receive a new HttpClient instance each time it is constructed. However, the HttpMessageHandler objects in the pool are the objects that are reused by multiple Http requests.
Finally, once you have your type classes implemented and have them registered with AddHttpClient(), you can use it anywhere you can have services injected by DI, for example in any Razor page code or any controller of an MVC web app, like in the following code from eShopOnContainers.
Until this point, the code shown is just performing regular Http requests, but the ‘magic’ comes in the following sections where, just by adding policies and delegating handlers to your registered typed clients, all the Http requests to be done by HttpClient will behave taking into account resilient policies such as retries with exponential backoff, circuit breakers, or any other custom delegating handler to implement additional security features, like using auth tokens, or any other custom feature.
The recommended approach for retries with exponential backoff is to take advantage of more advanced .NET libraries like the open-source Polly library.
Polly is a .NET library that provides resilience and transient-fault handling capabilities. You can implement those capabilities by applying Polly policies such as Retry, Circuit Breaker, Bulkhead Isolation, Timeout, and Fallback. Polly targets .NET 4.x and the .NET Standard Library 1.0 (which supports .NET Core).
However, using Polly’s library with your own custom code with HttpClient can be significantly complex. In the original version of eShopOnContainers, there was a ResilientHttpClient building-block based on Polly. But with the release of HttpClientFactory, resilient Http communication has become much simpler to implement, so that building-block was deprecated from eShopOnContainers.
The following steps show how you can use Http retries with Polly integrated into HttpClientFactory, which is explained in the previous section.
Reference the ASPNET Core 2.1 packages
Your project has to be using the ASPNET Core 2.1 packages from NuGet. You typically need the AspNetCore metapackage, and the extension package Microsoft.Extensions.Http.Polly.
Configure a client with Polly’s Retry policy, in Startup
As previously shown in previous sections, you need to define a named or typed client HttpClient configuration in your standard Startup.ConfigureServices(...) method, but now, you add incremental code specifying the policy for the Http retries with exponential backoff, as below:
The ‘AddPolicyHandler()’ method is what adds policies to the ‘HttpClient‘ objects you will use. In this case, it is adding a Polly’s policy for Http Retries with exponential backoff.
In order to have a more modular approach, the Http Retry Policy can be defined in a separate method within the , as the following code.
With Polly, you can define a Retry policy with the number of retries, the exponential backoff configuration, and the actions to take when there is an HTTP exception, such as logging the error. In this case, the policy is configured to try six times with an exponential retry, starting at two seconds.
As noted earlier, you should handle faults that might take a variable amount of time to recover from, as it might happen when you try to connect to a remote service or resource. Handling this type of fault can improve the stability and resiliency of an application.
In a distributed environment, calls to remote resources and services can fail due to transient faults, such as slow network connections and timeouts, or if resources are being slow or are temporarily unavailable. These faults typically correct themselves after a short time, and a robust cloud application should be prepared to handle them by using a strategy like the Retry pattern.
However, there can also be situations where faults are due to unanticipated events that might take much longer to fix. These faults can range in severity from a partial loss of connectivity to the complete failure of a service. In these situations, it might be pointless for an application to continually retry an operation that is unlikely to succeed. Instead, the application should be coded to accept that the operation has failed and handle the failure accordingly.
Using Http retries carelessly could result in creating a Denial of Service (DoS) attack within your own software. As a microservice fails or performs slowly, multiple clients might repeatedly retry failed requests. That creates a dangerous risk of exponentially increasing traffic targeted at the failing service.
Therefore, you need some kind of defense barrier so the retries stop requests when it is not worth to keep trying. That defense barrier is precisely the circuit breaker.
The Circuit Breaker pattern has a different purpose than the Retry pattern. The Retry pattern enables an application to retry an operation in the expectation that the operation will eventually succeed. The Circuit Breaker pattern prevents an application from performing an operation that is likely to fail. An application can combine these two patterns. However, the retry logic should be sensitive to any exception returned by the circuit breaker, and it should abandon retry attempts if the circuit breaker indicates that a fault is not transient.
As when implementing retries, the recommended approach for circuit breakers is to take advantage of proven .NET libraries like Polly and its native integration with HttpClientFactory.
Adding a circuit breaker policy into your HttpClientFactory outgoing middleware pipeline is as simple as adding a single incremental piece of code to what you already have when using HttpClientFactory.
The only addition here to the code used for HTTP call retries is the code where you add the Circuit Breaker policy to the list of policies to use, as shown in the following incremental code, part of the ConfigureServices() method.
The ‘AddPolicyHandler()’ method is what adds policies to the HttpClient objects you will use. In this case, it is adding a Polly’s policy for a circuit breaker.
In order to have a more modular approach, the Circuit Braker Policy is defined in a separate method named ‘GetCircuitBreakerPolicy()’, as the following code.
In the code example above, the circuit breaker policy is configured so it breaks or opens the circuit when there have been five exceptions when retrying the Http requests. Then, 30 seconds will be the duration or the break.
Circuit breakers should also be used to redirect requests to a fallback infrastructure if you had issues in a particular resource that is deployed in a different environment than the client application or service that is performing the HTTP call. That way, if there is an outage in the datacenter that impacts only your backend microservices but not your client applications, the client applications can redirect to the fallback services. Polly is planning a new policy to automate this failover policy scenario.
Of course, all those features are for cases where you are managing the failover from within the .NET code, as opposed to having it managed automatically for you by Azure, with location transparency.
From a usage point of view, when using HttpClient, there’s no need to add anything new here because the code is the same than when using HttpClient with HttpClientFactory, as shown in previous sections.
Whenever you start the eShopOnContainers solution in a Docker host, it needs to start multiple containers. Some of the containers are slower to start and initialize, like the SQL Server container. This is especially true the first time you deploy the eShopOnContainers application into Docker, because it needs to set up the images and the database. The fact that some containers start slower than others can cause the rest of the services to initially throw HTTP exceptions, even if you set dependencies between containers at the docker-compose level, as explained in previous sections. Those docker-compose dependencies between containers are just at the process level. The container’s entry point process might be started, but SQL Server might not be ready for queries. The result can be a cascade of errors, and the application can get an exception when trying to consume that particular container.
You might also see this type of error on startup when the application is deploying to the cloud. In that case, orchestrators might be moving containers from one node or VM to another (that is, starting new instances) when balancing the number of containers across the cluster’s nodes.
The way eShopOnContainers solves those issues when starting all the containers is by using the Retry pattern illustrated earlier.
There are a few ways you can break/open the circuit and test it with eShopOnContainers.
One option is to lower the allowed number of retries to 1 in the circuit breaker policy and redeploy the whole solution into Docker. With a single retry, there is a good chance that an HTTP request will fail during deployment, the circuit breaker will open, and you get an error.
Another option is to use custom middleware that is implemented in the Basket microservice. When this middleware is enabled, it catches all HTTP requests and returns status code 500. You can enable the middleware by making a GET request to the failing URI, like the following:
This request returns the current state of the middleware. If the middleware is enabled, the request return status code 500. If the middleware is disabled, there is no response.
This request enables the middleware.
This request disables the middleware.
For instance, once the application is running, you can enable the middleware by making a request using the following URI in any browser. Note that the ordering microservice uses port 5103.
http://localhost:5103/failing?enable
You can then check the status using the URI http://localhost:5103/failing, as shown in Figure 8-5.
Figure 8-5. Checking the state of the “Failing” ASP.NET middleware – In this case, disabled.
At this point, the Basket microservice responds with status code 500 whenever you call invoke it.
Once the middleware is running, you can try making an order from the MVC web application. Because the requests fail, the circuit will open.
In the following example, you can see that the MVC web application has a catch block in the logic for placing an order. If the code catches an open-circuit exception, it shows the user a friendly message telling them to wait.
Here’s a summary. The Retry policy tries several times to make the HTTP request and gets HTTP errors. When the number of retries reaches the maximum number set for the Circuit Breaker policy (in this case, 5), the application throws a BrokenCircuitException. The result is a friendly message, as shown in Figure 8-6.
Figure 8-6. Circuit breaker returning an error to the UI
You can implement different logic for when to open/break the circuit. Or you can try an HTTP request against a different back-end microservice if there is a fallback datacenter or redundant back-end system.
Finally, another possibility for the CircuitBreakerPolicy is to use Isolate (which forces open and holds open the circuit) and Reset (which closes it again). These could be used to build a utility HTTP endpoint that invokes Isolate and Reset directly on the policy. Such an HTTP endpoint could also be used, suitably secured, in production for temporarily isolating a downstream system, such as when you want to upgrade it. Or it could trip the circuit manually to protect a downstream system you suspect to be faulting.
A regular Retry policy can impact your system in cases of high concurrency and scalability and under high contention. To overcome peaks of similar retries coming from many clients in case of partial outages, a good workaround is to add a jitter strategy to the retry algorithm/policy. This can improve the overall performance of the end-to-end system by adding randomness to the exponential backoff. This spreads out the spikes when issues arise. When you use a plain Polly policy, code to implement jitter could look like the following example:
https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.1
https://github.com/aspnet/HttpClientFactory
https://github.com/App-vNext/Polly/wiki/Polly-and-HttpClientFactory
Health monitoring can allow near-real-time information about the state of your containers and microservices. Health monitoring is critical to multiple aspects of operating microservices and is especially important when orchestrators perform partial application upgrades in phases, as explained later.
Microservices-based applications often use heartbeats or health checks to enable their performance monitors, schedulers, and orchestrators to keep track of the multitude of services. If services cannot send some sort of “I’m alive” signal, either on demand or on a schedule, your application might face risks when you deploy updates, or it might simply detect failures too late and not be able to stop cascading failures that can end up in major outages.
In the typical model, services send reports about their status, and that information is aggregated to provide an overall view of the state of health of your application. If you are using an orchestrator, you can provide health information to your orchestrator’s cluster, so that the cluster can act accordingly. If you invest in high-quality health reporting that is customized for your application, you can detect and fix issues for your running application much more easily.
When developing an ASP.NET Core microservice or web application, you can use a library named HealthChecks from the ASP.NET team. (Early release is available at this GitHub repo) .
This library is easy to use and provides features that let you validate that any specific external resource needed for your application (like a SQL Server database or remote API) is working properly. When you use this library, you can also decide what it means that the resource is healthy, as we explain later.
In order to use this library, you need to first use the library in your microservices. Second, you need a front-end application that queries for the health reports. That front end application could be a custom reporting application, or it could be an orchestrator itself that can react accordingly to the health states.
You can see how the HealthChecks library is used in the eShopOnContainers sample application. To begin, you need to define what constitutes a healthy status for each microservice. In the sample application, the microservices are healthy if the microservice API is accessible via HTTP and if its related SQL Server database is also available.
In the future, you will be able to install the HealthChecks library as a NuGet package. But as of this writing, you need to download and compile the code as part of your solution. Clone the code available at https://github.com/dotnet-architecture/HealthChecks and copy the following folders to your solution:
src/common
src/Microsoft.AspNetCore.HealthChecks
src/Microsoft.Extensions.HealthChecks
src/Microsoft.Extensions.HealthChecks.SqlServer
You could also use additional checks like the ones for Azure (Microsoft.Extensions.HealthChecks.AzureStorage), but since this version of eShopOnContainers does not have any dependency on Azure, you do not need it. You do not need the ASP.NET health checks, because eShopOnContainers is based on ASP.NET Core.
Figure 8-7 shows the HealthChecks library in Visual Studio, ready to be used as a building block by any microservices.
Figure 8-7. ASP.NET Core HealthChecks library source code in a Visual Studio solution
As introduced earlier, the first thing to do in each microservice project is to add a reference to the three HealthChecks libraries. After that, you add the health check actions that you want to perform in that microservice. These actions are basically dependencies on other microservices (HttpUrlCheck) or databases (currently SqlCheck* for SQL Server databases). You add the action within the Startup class of each ASP.NET microservice or ASP.NET web application.
Each service or web application should be configured by adding all its HTTP or database dependencies as one AddHealthCheck method. For example, the MVC web application from eShopOnContainers depends on many services, therefore has several AddCheck methods added to the health checks.
For instance, in the following code you can see how the catalog microservice adds a dependency on its SQL Server database.
However, the MVC web application of eShopOnContainers has multiple dependencies on the rest of the microservices. Therefore, it calls one AddUrlCheck method for each microservice, as shown in the following example:
Thus, a microservice will not provide a “healthy” status until all its checks are healthy as well.
If the microservice does not have a dependency on a service or on SQL Server, you should just add a Healthy("Ok") check. The following code is from the eShopOnContainers basket.api microservice. (The basket microservice uses the Redis cache, but the library does not yet include a Redis health check provider.)
For a service or web application to expose the health check endpoint, it has to enable the UserHealthChecks([url_for_health_checks]) extension method. This method goes at the WebHostBuilder level in the main method of the Program class of your ASP.NET Core service or web application, right after UseKestrel as shown in the code below.