Most of you have undoubtedly heard of eXtreme programming practices introduced by Kent Beck. The popular adoption of those proven lightweight practices has led to sound design, effective implementation and better software quality. I’d also like to suggest a set of practices that will lend to the design and implementation of robust integration solutions.
With today’s article, I’d like to begin with the challenge of designing resiliency into the integration solutions you build. The notion of resiliency or fault tolerance is simply the ability for a system to provide high availability for its services despite potential faults or exceptions. This oft-neglected aspect of the enterprise integration (EI) system design is in fact essential to the operational integrity of the overall solution. EI is literally serious business. Most EI projects link together business critical systems and enable the flow of business processes within the enterprise. Once implemented, the integration platform is itself business critical, and the high availability of the system is essential.
There are four principles to keep in mind when designing resilient EI systems:
Eliminating Single Points of Failure
Isolating Effects of Failure
Minimizing Moving Parts
Implementing Failover Mechanisms
Eliminating Single Points of Failure (SPoF)
For an EI system, the SPoF is the proverbial “weakest link” in the integration chain. Any component whose failure results in the failure of the overall system should be regarded as a SPoF. During the design phase of the project, make it a regular practice to conduct peer architecture review sessions specifically focusing on SPoFs in the system. The review should cover both the software component design and the physical deployment model. This is important because an EI system may appear sound from a software component design perspective but may be physically deployed in a manner that introduces significant points of failure.
Isolating Effects of Failure
If component failure occurs, a well designed system will minimize the effects of that failure and keep it from cascading to other parts of the system. Failure containment is a matter of designed procedure. For instance, if an application adapter is receiving a message and writing it to a message queue, you want to ensure that the adapter holds a persisted copy of the message until the messaging system has acknowledged the delivery of the message to the destination queue. If by chance, the messaging system should crash before completing the delivery of the message, you can at least ensure that the message is persisted at the adapter, ready for retransmission.