Non-IBM Disclaimer

The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

Monday, June 27, 2022

API Connect Operator refactoring or how to move a cluster level Operator to a specific namespace

A customer had a single Openshift cluster running three API Connect instances - DEV, UAT, and STAGING. The Operators were installed on a cluster level and centrally managed all the three instances. The customer wanted to upgrade the APIC instances independently at his own pace i.e. upgrade DEV and test it out, then move to UAT, and finally get to STAGING. However, that was not quite possible.

The Operator based API Connect upgrade process consists from upgrading the Operator and then upgrading the Operand. It is a requirement that the Operator and the Operand have the same version and, therefore, implied that the Operand should be upgraded right after the Operator without any major delay. Having a central Operator required the customer to upgrade all three Operands at the same time. We needed to decouple the Operands.

Long story short, after analyzing the underlaying limitations, Openshift architecture, and technology aspects of the possible solutions the decision was made to decentralize the Operator. Instead of having a single, central, cluster level Operator, the proposed to-be architecture was based on separate Operators installed in each Operand namespace. This approach was not a perfect solution as it did not address all of the restrictions. For example, the ibm-common-services and API Connect CRDs remained to be cluster level resources. However, this is a supported topology and clearly allowed the customer to get much closer to their goal with upgrading each API Connect instance separately as long as the cluster level resources do not change or the change was backwards compatible.

The operational part was straight forward - uninstall the central Operators and install them in each Operand namespace. The new Operator auto-discovered the local Operand and took control of the resources. Most pods, including the gateways pods, were restarted. After about an hour the whole instance was healthy. And the best part is that there was no need to reinstall the API Connect instance. Currently, there are three sets of the Operators, one in each Operand namespace. This allows each Operand to be maintained separately from other Operands.

No comments:

Post a Comment