No results found

Your search did not match any results.

Bulletproof Enterprise Java for Meeting Challenging Production Requirements

How to make your enterprise Java applications resilient for the hard production life by using timeouts, retries, circuit breakers, bulkheads, and backpressure.

by Sebastian Daschner

It’s one thing to develop enterprise Java applications and to deploy and validate them in test environments. Running applications in production, however, is a whole other story. Production life is harsh, unforgivable, and the ultimate verification whether your application is able to deliver value to its users. Besides knowing the enterprise Java APIs, developers also need to know how to meet production requirements.

Which non-functional requirements do we enterprise developers need to be aware of to build stable and resilient applications? How can we implement different resiliency approaches, such as circuit breakers, bulkheads, or backpressure using the Java EE API, MicroProfile, or certain Java EE extensions? And furthermore, how do enterprise Java resiliency approaches play along with new cloud native technologies such as Kubernetes and Istio?

Requirements

First of all, we need to clarify what it means for an application to be bulletproof.

The application should guarantee to execute the functionality in a stable, responsible fashion without minor circumstances causing the application to crash irrecoverably. Therefore, the application should be able to tolerate and recover from small errors during execution, especially without affecting other, unrelated functionalities. It also means that the overall health of the system should be regarded, which requires all applications to not blindly overload other applications that currently might be under load.

Furthermore, there is the saying of being conservative in what you do and liberal in what you accept. Similarly, enterprise applications should not be overly strict in rejecting messages that are technically comprehensible but do not quite follow the specifications.

Enterprise Java

We will use an example enterprise system, comprising an instrument craft shop, an application for musical instruments, and a “maker bot” back end, which 3D-prints the instruments. Clients use the HTTP API of the instrument craft shop to order new instruments. The instrument craft shop calls the maker bot synchronously via HTTP to order the production of new instruments.

We will first see which enterprise Java functionality needs to be considered in this example in regard to resiliency.

Timeouts

In order to avoid dead-lock situations, it’s crucial to build in timeouts for synchronous communication. Timeouts are a trade-off between liveness, when an application continues to be able to handle incoming requests, and progress, when we reject the processing of a request that might otherwise have finished successfully soon.

While timeouts are crucial for all kinds of synchronous communication, we’ll specifically focus on HTTP invocations.

Our instrument craft shop comprises a gateway component to access the maker bot HTTP endpoints via the JAX-RS client:


@ApplicationScoped
public class MakerBot {

    private Client client;
    private WebTarget target;

    @PostConstruct
    private void initClient() {
        client = ClientBuilder.newBuilder()
                .connectTimeout(2, TimeUnit.SECONDS)
                .readTimeout(4, TimeUnit.SECONDS)
                .build();
        target = client
                .target("http://maker-bot:9080/maker-bot/resources/jobs");
    }

    public void printInstrument(InstrumentType type) {
        JsonObject requestBody = createRequestBody(type);
        Response response = sendRequest(requestBody);
        validateResponse(response);
    }

    // ...
}
 

Since JAX-RS version 2.1, the ClientBuilder supports a standardized timeout configuration via the connectTimeout() and readTimeout() methods. Depending on the HTTP implementation being used, not specifying timeout values might end up in invocations blocking infinitely.

The actual timeout values, of course, depend on the actual application and the environment setup.

Circuit Breakers

Similar to circuit breakers in electrical engineering, circuit breakers in software detect failures, or slow responses, and prevent further damage by inhibiting actions that are doomed to fail. We can specify the circumstances when a circuit breaker should interrupt the execution of some functionality based on the previous executions.

There are multiple third-party libraries available that implement circuit breakers, including the MicroProfile Fault Tolerance project, which integrates very well with Java EE and is supported by a few application container vendors. The following declares the printInstrument method of the MakerBot class as being guarded by a MicroProfile circuit breaker with default behavior:


import org.eclipse.microprofile.faulttolerance.CircuitBreaker;
...

@CircuitBreaker
public void printInstrument(InstrumentType type) {
    JsonObject requestBody = createRequestBody(type);
    Response response = sendRequest(requestBody);
    validateResponse(response);
}

The @CircuitBreaker annotation will cause the method execution to be interrupted if it fails—that is, if it throws an exception more than 50 percent of the time within 20 invocations—by default. After the circuit is opened, the execution will be interrupted for at least 5 seconds, by default. These default values can be overridden using the annotation.

It's possible to define fall-back behavior using the @Fallback annotation, which refers to a fall-back handler class or method, respectively.

Retries

The motivation behind retries is to iron out temporary failures by immediately retrying a failed action. This retry happens transparently to the calling functionality.

It’s simple to implement technically motivated retries using MicroProfile Fault Tolerance. The @Retry annotation will cause method invocations to be re-executed up to three times if an exception occurs. We can further configure the behavior, such as with delay times or exception types, using the annotation values.

Similar to circuit breakers, it’s also possible to define a @Fallback behavior if the invocation still fails after the maximum number of retries.

Bulkheads

Similar to compartments on a ship, bulkheads are intended to partition the functionality of software into sections that can fail individually without causing the overall application to become unresponsive. They prevent errors from cascading further, while the rest of the application stays functional.

In enterprise Java, the Bulkhead pattern is applied by defining multiple pools, such as database connection pools or thread pools. With multiple thread pools, we can ensure that specific parts of the application are not affected if another thread pool is currently exhausted.

However, enterprise Java applications are not supposed to start or manage their own threads; rather, they must use platform functionalities to provide managed threads. For that purpose, Java EE ships with a ManagedExecutorService that provides container-managed threads, usually based on a single thread pool.

The Porcupine library, by Java EE expert Adam Bien, supports further defining container-managed thread pools that can be configured individually. The following shows the definition and usage of two dedicated ExecutorServices for retrieving and creating instruments:


@Inject
@Dedicated("instruments-read")
ExecutorService readExecutor;

@Inject
@Dedicated("instruments-write")
ExecutorService writeExecutor;


// usage within method body ...
CompletableFuture.supplyAsync(() -> instrumentCraftShop.getInstruments(), readExecutor);


// usage within method body ...
CompletableFuture.runAsync(() -> instrumentCraftShop.craftInstrument(instrument), writeExecutor)

In general, these executor services can be used from within our entire application. However, another circumstance needs to be considered when using HTTP resources.

Application servers usually make use of a single thread pool for the HTTP request threads that handle the incoming requests. Of course, a single request thread pool makes it difficult to construct multiple bulkheads in our application.

Therefore, we can use asynchronous JAX-RS resources to immediately manage the handling of incoming requests to the dedicated executor services:


@Path("instruments")
@Consumes(MediaType.APPLICATION_JSON)
@Produces(MediaType.APPLICATION_JSON)
public class InstrumentsResource {

    @Inject
    InstrumentCraftShop instrumentCraftShop;

    @Inject
    @Dedicated("instruments-read")
    ExecutorService readExecutor;

    @Inject
    @Dedicated("instruments-write")
    ExecutorService writeExecutor;

    @GET
    public CompletionStage<List<Instrument>> getInstruments() {
        return CompletableFuture
                .supplyAsync(() -> instrumentCraftShop.getInstruments(), readExecutor);
    }

    @POST
    public CompletionStage<Response> createInstrument(@Valid @NotNull Instrument instrument) {
        return CompletableFuture.runAsync(
                () -> instrumentCraftShop.craftInstrument(instrument), writeExecutor)
                .thenApply(c -> Response.noContent().build())
                .exceptionally(e -> Response.status(Response.Status.INTERNAL_SERVER_ERROR)
                        .header("X-Error", e.getMessage())
                        .build());
    }
}

Incoming requests for both retrieving and creating instruments will be passed on to separate thread pools. Since JAX-RS 2.1, returning a CompletionStage or compatible type is sufficient to declare the JAX-RS resource as asynchronous. The request will be suspended and resumed once the asynchronous processing has been finished.

If one of these two functionalities for retrieving or creating instruments, respectively, will be overloaded and run out of threads in the pools, the other one will be unaffected by that. The shared request thread pool is not likely to run out of threads, because the handling of requests is passed to other, managed threads immediately.

The reason that we use the Porcupine library instead of the Bulkhead functionality of MicroProfile Fault Tolerance is that the former gives us direct access and control to managed executor services that we connect with our asynchronous JAX-RS resources.

Backpressure

Applications that are under heavy load can apply backpressure by notifying clients about their current status. This can happen in several ways, for example, by adding metadata to responses or—more drastically—by returning failure responses.

We’ll want to implement backpressure if we consider it more important that our application stays responsive, and especially that it stays able to respond within its service level agreement (SLA), rather than delaying responses. In regard to meeting the SLA of the overall system, it might be more helpful to immediately respond with an error to enable clients to maybe invoke a different application or instance thereof, rather than consuming all of the SLA time and still not be able to properly handle the request.

In order to instruct our executor service to immediately reject invocations that exceed the executor’s wait queue if all threads are busy, we need to further configure the behavior. ExecutorConfigurator is the managed bean that is used by Porcupine, which we can specialize using CDI:


@Specializes
public class CustomExecutorConfigurator extends ExecutorConfigurator {

    @Override
    public ExecutorConfiguration forPipeline(String name) {
        if ("instruments-read".equals(name))
            return new ExecutorConfiguration.Builder()
                    .abortPolicy()
                    .queueCapacity(4)
                    .build();

        return new ExecutorConfiguration.Builder()
                .abortPolicy()
                .build();
    }
}

The overridden method forPipeline is used to construct executor services for the qualified name. The abortPolicy invocations will instruct the underlying thread pools to immediately reject new invocations that exceed the resources with a RejectedExecutionException. This is the desired behavior for our purposes.

In order to notify the client about the unavailability of our application, we will map this exception to an HTTP 503 response:


@Provider
public class RejectedExecutionHandler implements ExceptionMapper<RejectedExecutionException> {

    @Override
    public Response toResponse(RejectedExecutionException exception) {
        return Response.status(Response.Status.SERVICE_UNAVAILABLE).build();
    }
}

If clients now invoke functionality that use bulkheads which are currently under water, they immediately get a HTTP 503 response and might still be able to connect to another system within the SLA time.

However, which queue sizes should we configure in order to meet our SLA time and not overzealously reject requests that would be handled in time?

If we look at Little’s law in queueing theory, we’ll see that the mean response time is the mean number (of requests) in the system divided by the mean throughput. If we further take into account that we probably have a multiprocessing unit in the system, we can derive a maximum latency, as described in an article by Martin Thompson: maximum latency = (transaction time / number of threads) * (queue length), given a nonzero queue length and given that the maximum latency is greater than or equal to the transaction time. Transforming that formula leaves us with this: queue length = maximum latency / (transaction time / number of threads).

For example, suppose we want to guarantee an SLA time of 200 milliseconds (maximum latency) with a measured average transaction time of 20 milliseconds and four available threads. Applying this formula leaves us with a queue length of 50 (because 200/(20/5) equals 50). If the current number of threads in the system exceeds this queue size, new requests will immediately be rejected, not after a waiting time of 200 milliseconds. This serves as a guideline for how to configure the individual ExecutorService definitions in our application.

Conclusion

There are a few things that need to be taken into consideration when running enterprise applications in production. We want to be sure that our application will survive the hard production life as well as possible, without cascading failures, system overloads, or liveness issues.

It’s possible to implement resiliency concerns, such as timeouts, retries, circuit breakers, bulkheads, and backpressure, as part of our applications by using plain enterprise Java and extensions thereof. The Java EE APIs already tackle a few concerns; the rest can be covered by extensions such as Porcupine and MicroProfile Fault Tolerance.

The second part of this article series will show how modern, cloud native technologies such as Istio and Kubernetes support running bulletproof, resilient enterprise applications.

Further Resources

About the Author

Sebastian Daschner is a self-employed Java consultant, author, and trainer and is enthusiastic about programming and enterprise Java. He is the author of the book Architecting Modern Java EE Applications and is participating in the JCP, helping to form the future standards of Java EE by serving in the JAX-RS, JSON-P, and Config Expert Groups and collaborating on various open source projects. For his contributions in the Java community and ecosystem he was recognized as a Java Champion, an Oracle Developer Champion, and a double 2016 JavaOne Rock Star. Besides Java, he is also a heavy user of Linux and cloud native technologies. He evangelizes computer science practices on his blog, in his newsletter, and on Twitter. When not working with Java, he also loves to travel the world—either by plane or motorbike.