Sidestepping Dependency Ordering with AppSwitch
We are going through an interesting cycle of application decomposition and recomposition. While the microservice paradigm is driving monolithic applications to be broken into separate individual services, the service mesh approach is helping them to be connected back together into well-structured applications. As such, microservices are logically separate but not independent. They are usually closely interdependent and taking them apart introduces many new concerns such as need for mutual authentication between services. Istio directly addresses most of those issues.
Dependency ordering problem
An issue that arises due to application decomposition and one that Istio doesn’t address is dependency ordering – bringing up individual services of an application in an order that guarantees that the application as a whole comes up quickly and correctly. In a monolithic application, with all its components built-in, dependency ordering between the components is enforced by internal locking mechanisms. But with individual services potentially scattered across the cluster in a service mesh, starting a service first requires checking that the services it depends on are up and available.
Dependency ordering is deceptively nuanced with a host of interrelated problems. Ordering individual services requires having the dependency graph of the services so that they can be brought up starting from leaf nodes back to the root nodes. It is not easy to construct such a graph and keep it updated over time as interdependencies evolve with the behavior of the application. Even if the dependency graph is somehow provided, enforcing the ordering itself is not easy. Simply starting the services in the specified order obviously won’t do. A service may have started but not be ready to accept connections yet. This is the problem with docker-compose’s depends-on
tag, for example.
Apart from introducing sufficiently long sleeps between service startups, a common pattern that is often used is to check for readiness of dependencies before starting a service. In Kubernetes, this could be done with a wait script as part of the init container of the pod. However that means that the entire application would be held up until all its dependencies come alive. Sometimes applications spend several minutes initializing themselves on startup before making their first outbound connection. Not allowing a service to start at all adds substantial overhead to overall startup time of the application. Also, the strategy of waiting on the init container won’t work for the case of multiple interdependent services within the same pod.
Example scenario: IBM WebSphere ND
Let us consider IBM WebSphere ND – a widely deployed application middleware – to grok these problems more closely. It is a fairly complex framework in itself and consists of a central component called deployment manager (dmgr
) that manages a set of node instances. It uses UDP to negotiate cluster membership among the nodes and requires that deployment manager is up and operational before any of the node instances can come up and join the cluster.
Why are we talking about a traditional application in the modern cloud-native context? It turns out that there are significant gains to be had by enabling them to run on the Kubernetes and Istio platforms. Essentially it’s a part of the modernization journey that allows running traditional apps alongside green-field apps on the same modern platform to facilitate interoperation between the two. In fact, WebSphere ND is a demanding application. It expects a consistent network environment with specific network interface attributes etc. AppSwitch is equipped to take care of those requirements. For the purpose of this blog however, I’ll focus on the dependency ordering requirement and how AppSwitch addresses it.
Simply deploying dmgr
and node instances as pods on a Kubernetes cluster does not work. dmgr
and the node instances happen to have a lengthy initialization process that can take several minutes. If they are all co-scheduled, the application typically ends up in a funny state. When a node instance comes up and finds that dmgr
is missing, it would take an alternate startup path. Instead, if it had exited immediately, Kubernetes crash-loop would have taken over and perhaps the application would have come up. But even in that case, it turns out that a timely startup is not guaranteed.
One dmgr
along with its node instances is a basic deployment configuration for WebSphere ND. Applications like IBM Business Process Manager that are built on top of WebSphere ND running in production environments include several other services. In those configurations, there could be a chain of interdependencies. Depending on the applications hosted by the node instances, there may be an ordering requirement among them as well. With long service initialization times and crash-loop restarts, there is little chance for the application to start in any reasonable length of time.
Sidecar dependency in Istio
Istio itself is affected by a version of the dependency ordering problem. Since connections into and out of a service running under Istio are redirected through its sidecar proxy, an implicit dependency is created between the application service and its sidecar. Unless the sidecar is fully operational, all requests from and to the service get dropped.
Dependency ordering with AppSwitch
So how do we go about addressing these issues? One way is to defer it to the applications and say that they are supposed to be “well behaved” and implement appropriate logic to make themselves immune to startup order issues. However, many applications (especially traditional ones) either timeout or deadlock if misordered. Even for new applications, implementing one off logic for each service is substantial additional burden that is best avoided. Service mesh needs to provide adequate support around these problems. After all, factoring out common patterns into an underlying framework is really the point of service mesh.
AppSwitch explicitly addresses dependency ordering. It sits on the control path of the application’s network interactions between clients and services in a cluster and knows precisely when a service becomes a client by making the connect
call and when a particular service becomes ready to accept connections by making the listen
call. It’s service router component disseminates information about these events across the cluster and arbitrates interactions among clients and servers. That is how AppSwitch implements functionality such as load balancing and isolation in a simple and efficient manner. Leveraging the same strategic location of the application’s network control path, it is conceivable that the connect
and listen
calls made by those services can be lined up at a finer granularity rather than coarsely sequencing entire services as per a dependency graph. That would effectively solve the multilevel dependency problem and speedup application startup.
But that still requires a dependency graph. A number of products and tools exist to help with discovering service dependencies. But they are typically based on passive monitoring of network traffic and cannot provide the information beforehand for any arbitrary application. Network level obfuscation due to encryption and tunneling also makes them unreliable. The burden of discovering and specifying the dependencies ultimately falls to the developer or the operator of the application. As it is, even consistency checking a dependency specification is itself quite complex and any way to avoid requiring a dependency graph would be most desirable.
The point of a dependency graph is to know which clients depend on a particular service so that those clients can then be made to wait for the respective service to become live. But does it really matter which specific clients? Ultimately one tautology that always holds is that all clients of a service have an implicit dependency on the service. That’s what AppSwitch leverages to get around the requirement. In fact, that sidesteps dependency ordering altogether. All services of the application can be co-scheduled without regard to any startup order. Interdependencies among them automatically work themselves out at the granularity of individual requests and responses, resulting in quick and correct application startups.
AppSwitch model and constructs
Now that we have a conceptual understanding of AppSwitch’s high-level approach, let’s look at the constructs involved. But first a quick summary of the usage model is in order. Even though it is written for a different context, reviewing my earlier blog on this topic would be useful as well. For completeness, let me also note AppSwitch doesn’t bother with non-network dependencies. For example it may be possible for two services to interact using IPC mechanisms or through the shared file system. Processes with deep ties like that are typically part of the same service anyway and don’t require framework’s intervention for ordering.
At its core, AppSwitch is built on a mechanism that allows instrumenting the BSD socket API and other related calls like fcntl
and ioctl
that deal with sockets. As interesting as the details of its implementation are, it’s going to distract us from the main topic, so I’d just summarize the key properties that distinguish it from other implementations. (1) It’s fast. It uses a combination of seccomp
filtering and binary instrumentation to aggressively limit intervening with application’s normal execution. AppSwitch is particularly suited for service mesh and application networking use cases given that it implements those features without ever having to actually touch the data. In contrast, network level approaches incur per-packet cost. Take a look at this blog for some of the performance measurements. (2) It doesn’t require any kernel support, kernel module or a patch and works on standard distro kernels (3) It can run as regular user (no root). In fact, the mechanism can even make it possible to run Docker daemon without root by removing root requirement to network containers (4) It doesn’t require any changes to the applications whatsoever and works for any type of application – from WebSphere ND and SAP to custom C apps to statically linked Go apps. Only requirement at this point is Linux/x86.
Decoupling services from their references
AppSwitch is built on the fundamental premise that applications should be decoupled from their references. The identity of applications is traditionally derived from the identity of the host on which they run. However, applications and hosts are very different objects that need to be referenced independently. Detailed discussion around this topic along with a conceptual foundation of AppSwitch is presented in this research paper.
The central AppSwitch construct that achieves the decoupling between services objects and their identities is service reference (reference, for short). AppSwitch implements service references based on the API instrumentation mechanism outlined above. A service reference consists of an IP:port pair (and optionally a DNS name) and a label-selector that selects the service represented by the reference and the clients to which this reference applies. A reference supports a few key properties. (1) It can be named independently of the name of the object it refers to. That is, a service may be listening on an IP and port but a reference allows that service to be reached on any other IP and port chosen by the user. This is what allows AppSwitch to run traditional applications captured from their source environments with static IP configurations to run on Kubernetes by providing them with necessary IP addresses and ports regardless of the target network environment. (2) It remains unchanged even if the location of the target service changes. A reference automatically redirects itself as its label-selector now resolves to the new instance of the service (3) Most important for this discussion, a reference remains valid even as the target service is coming up.
To facilitate discovering services that can be accessed through service references, AppSwitch provides an auto-curated service registry. The registry is automatically kept up to date as services come and go across the cluster based on the network API that AppSwitch tracks. Each entry in the registry consists of the IP and port where the respective service is bound. Along with that, it includes a set of labels indicating the application to which this service belongs, the IP and port that the application passed through the socket API when creating the service, the IP and port where AppSwitch actually bound the service on the underlying host on behalf of the application etc. In addition, applications created under AppSwitch carry a set of labels passed by the user that describe the application together with a few default system labels indicating the user that created the application and the host where the application is running etc. These labels are all available to be expressed in the label-selector carried by a service reference. A service in the registry can be made accessible to clients by creating a service reference. A client would then be able to reach the service at the reference’s name (IP:port). Now let’s look at how AppSwitch guarantees that the reference remains valid even when the target service has not yet come up.
Non-blocking requests
AppSwitch leverages the semantics of the BSD socket API to ensure that service references appear valid from the perspective of clients as corresponding services come up. When a client makes a blocking connect call to another service that has not yet come up, AppSwitch blocks the call for a certain time waiting for the target service to become live. Since it is known that the target service is a part of the application and is expected to come up shortly, making the client block rather than returning an error such as ECONNREFUSED
prevents the application from failing to start. If the service doesn’t come up within time, an error is returned to the application so that framework-level mechanisms like Kubernetes crash-loop can kick in.
If the client request is marked as non-blocking, AppSwitch handles that by returning EAGAIN
to inform the application to retry rather than give up. Once again, that is in-line with the semantics of socket API and prevents failures due to startup races. AppSwitch essentially enables the retry logic already built into applications in support of the BSD socket API to be transparently repurposed for dependency ordering.
Application timeouts
What if the application times out based on its own internal timer? Truth be told, AppSwitch can also fake application’s perception of time if wanted but that would be overstepping and actually unnecessary. Application decides and knows best how long it should wait and it’s not appropriate for AppSwitch to mess with that. Application timeouts are conservatively long and if the target service still hasn’t come up in time, it is unlikely to be a dependency ordering issue. There must be something else going on that should not be masked.
Wildcard service references for sidecar dependency
Service references can be used to address the Istio sidecar dependency issue mentioned earlier. AppSwitch allows the IP:port specified as part of a service reference to be a wildcard. That is, the service reference IP address can be a netmask indicating the IP address range to be captured. If the label selector of the service reference points to the sidecar service, then all outgoing connections of any application for which this service reference is applied, will be transparently redirected to the sidecar. And of course, the service reference remains valid while sidecar is still coming up and the race is removed.
Using service references for sidecar dependency ordering also implicitly redirects application’s connections to the sidecar without requiring iptables and attendant privilege issues. Essentially it works as if the application is directly making connections to the sidecar rather than the target destination, leaving the sidecar in charge of what to do. AppSwitch would interject metadata about the original destination etc. into the data stream of the connection using the proxy protocol that the sidecar could decode before passing the connection through to the application. Some of these details were discussed here. That takes care of outbound connections but what about incoming connections? With all services and their sidecars running under AppSwitch, any incoming connections that would have come from remote nodes would be redirected to their respective remote sidecars. So nothing special to do about incoming connections.
Summary
Dependency ordering is a pesky problem. This is mostly due to lack of access to fine-grain application-level events around inter-service interactions. Addressing this problem would have normally required applications to implement their own internal logic. But AppSwitch makes those internal application events to be instrumented without requiring application changes. AppSwitch then leverages the ubiquitous support for the BSD socket API to sidestep the requirement of ordering dependencies.
Acknowledgements
Thanks to Eric Herness and team for their insights and support with IBM WebSphere and BPM products as we modernized them onto the Kubernetes platform and to Mandar Jog, Martin Taillefer and Shriram Rajagopalan for reviewing early drafts of this blog.