
How to Fix Top 10 Kubernetes Problems Technology Leaders Faces In An Enterprise
Aug 16
9 min read
1
37

Kubernetes stands as a remarkable innovation in Cloud Computing. It enables enterprises to scale efficiently to meet their expanding business needs. Its self-healing capability allows businesses to operate mission-critical applications around the clock. Additionally, zero downtime deployment through rollouts and rollbacks ensures a smooth customer experience.
There is no problem to Kubernetes Technology it self. The problem lies on how it was adopted, operationalized, and architected by the users. In this article we are to discuss Top 10 Kubernetes Problems Technology Leaders Faces In An Enterprise. I will share also some of the strategies to address these.
Disclaimer: This article is not associated nor represents any products, services, technology, or organization I am affiliated with. It is solely a representation of my thoughts / experiences / opinion. Information/events/scenarios in this post are merely coincidental.
Unwanted Deployment of New Container Image to Kubernetes Production
When a pipeline are built by novice Engineer using technology like Jenkins or ArgoCD for continuous deployment, and pods or containers are deleted or rebooted in the Kubernetes cluster during issue troubleshooting or failed deployment. The behavior of the pipeline is to automatically sync the pods and containers with whatever latest image are available in the image registry which may not be the same as the version to the original deployment. As a result, the image is replaced by a newer one which may not be ready for release. Thus sending changes that are unwanted to Production.
To avoid this issue, the pipeline should be designed to be foolproof by making it version-aware, ensuring it only retrieves specific images from the repository instead of relying on the latest version. Additionally, at the end of the continuous deployment pipeline, an automated approval mechanism should be sent to the stakeholder to validate the deployment. Approvers have the option to either reject or accept a release at a click of a button. These two strategies help prevent unintended deployments to production. Moreover, image auto-syncing can be turned off based on the pipeline's design on the CD portion.
Catching with the Latest Kubernetes Version
The speed of version release of Kubernetes is three times per year. Often, enterprise companies play catch-up. Based on experiences, enterprises are seen to be lagging 5-10 versions behind. Those that run Kubernetes as a Managed Service in the cloud are not spared either. Only master nodes are upgraded by the provider, and when worker nodes get outdated, some cloud providers charges for Extended Support. It's not that enterprises don't want to upgrade to the latest version, but the logistics it takes internally to upgrade a Kubernetes cluster go against time.
A good strategy is to make Kubernetes upgrades a pre-approved change and get the concurrence of the tenants for an agreed schedule. Finally, make the upgrade implementation plan a standard to make it repeatable in the future. Lastly, Kubernetes involves different stakeholders, not just IT, so create a clear and simple RACI Matrix to ensure that accountability and approvals during changes are clearly defined.
End of Life Kubernetes Version
This is a problem when an organization fails to play catch-up, and they get stuck with an outdated version of Kubernetes. Another situation is when a tech team in the enterprise inherits a Kubernetes platform from a vendor, or remnants of the past. Problem arises when apps running inside the cluster are not compatible with the latest version, or there will be too many undocumented configurations that won't allow the nodes or pods to start after a reboot during an upgrade. To make the situation worse, there is no rollback in Kubernetes versions. Therefore; upgrade is not recommended in this situation.
To fix this problem, the best strategy is to migrate the apps from the old Kubernetes cluster to a newly built environment with the latest version and have it properly architected and documented. With this approach, in case there is a compatibility issue, live apps in production are not affected because the migration itself becomes the testing ground for version compatibility. The lack of rollback is not a concern either as we get to keep the current apps running in the old environment parallel to migration. Downtime is mitigated as well because after migration cutover is just a DNS turnkey to the new one. Ensure that a solid migration playbook is created for each migration. Buy-in from different stakeholders is crucial to succeed in this approach.
Treating Kubernetes Like a Virtual Machine
I've witnessed instances in the industry where someone tries to port a monolithic application to a Kubernetes distributed system just to claim being cloud native. As a result, apps run in a single pod like monolithic virtual machines or apps behavior is abnormal. I've also seen cases where micro services communicate with each other by traversing the internet instead of leveraging the software-defined network of Kubernetes causing latency, rather than fast network pod to pod communication. Seen Information Security insisting on installing antivirus software on Kubernetes pods, even though they are immutable by nature. Persistent volumes in self-hosted Kubernetes are mounted on a single node rather than distributed across the cluster. Due to lack of familiarity and skillset Kubernetes are treated like a virtual machines. As a result; enterprise causes more harm than good using Kubernetes, the technology investment are not maximized, and the expectation of improved performance is met by latency and downtime.
There are two strategies to solve the problems mentioned above. For enterprise with sufficient funding, they must hire a third-party provider. The provider should be able to help the enterprise with a clear Kubernetes Adoption Framework. From Scoping, Discovery, Planning, Architecture, Build, Onboarding of Workload, Operationalization, Training, and Handover. Avoid partners who are just there to install a Kubernetes. For those with limited funding go to the route of "People first before Cluster" strategy. Be willing to do it right from the onset. Hire external talents to bring discipline and new practice to the organization. Put your best engineers forward, allow them to experiment and fail at the start. Avoid the delusion of having a "legacy engineer" changes title to DevOps then expect to run your Kubernetes cluster like a pro. Either invest in training existing engineers or get new talented ones to reinforce them.
Improper Kubernetes Operating Model and Governance
Many enterprises started with Kubernetes as a science project. Then engineers began to onboard some workloads for POC. Other enterprises had their Developers turn  Kubernetes Clusters Development environment to Production. Worst POC turning to Production. As time goes by, everyone has access to Kubernetes Production Environment, whether that be vendors, developers, infrastructure, or end users. It will just be a matter of time before downtime occurs. Unstable production is a byproduct of such practices, resulting in disparity of environment, inconsistent configuration, unauthorized changes, or even security breaches, e.g., certificates and keys getting published in a public repository--all results to downtime or worst compromise. These are the worst nightmare for a technology leaders.
A multi-step strategy is needed to prevent those from happening. First, define a RACI Matrix and R&R of each stakeholder in your Kubernetes Cluster and DevOps. RACI Matrix is a document that defines list of key critical activities and who are the people that are responsible, accountable, consulted, and informed for each activities. Build a Access Role Matrix, This document contains the ownership of different stakeholders through various Kubernetes Clusters. Create a build and release process that outlines the steps of getting changes live from code to production. Once you have all of these, create a CICD pipeline to automate releases in production. Use the RACI Matrix in-building and operationalizing the pipeline.
Accumulated Runtime Vulnerabilities of Application in Kubernetes
The typical deployment process of an app to Kubernetes is that codes are subjected to vulnerability scanning. For a more up-to-date organization, they also have image scanning. All of these are done during the pre-deployment stage. Once deployed to Kubernetes, traditional enterprises don't have the capability to detect runtime security vulnerabilities. Thus, stakeholders of the application are surprised that their next release is full of vulnerabilities. This is more rampant in apps that do not make frequent releases. Releases are then delayed due to security compliance issues—worse, vulnerabilities remain undetected. This happens because vulnerabilities evolve real-time, those that are secure in the past may not be secured the next day.
To solve this problem, an enterprise must implement a runtime security for Kubernetes and containers. A good example of runtime security platform is a technology named SysDig. Apart from runtime security, SysDig is an agentic security platform built for Kubernetes that can autonomously make decisions. Security must work at the pod, container, and network layer of Kubernetes. As soon as vulnerabilities are detected at runtime, engineers can then remediate these vulnerabilities immediately. The runtime security must be integrated to the pipeline to trigger deployment process once fix is available. SysDig has the capability to tell the exact fix of the vulnerability.
Expiring Certificates
This problem usually starts with a mundane routine like Kubernetes upgrade or OS node update. Once the activity is complete and an innocent engineer reboots the node to have these patches takes effect. All of a sudden the cluster won't go back up, or the nodes got stuck to Pending state. The entire engineering team begins to investigate only to find out that the root cause to be an expired K8 certificate. Kubernetes clusters utilize TLS certificates for secure communication between various components, such as the API server, kubelets, and etcd. By default, these certificates, especially those managed by kubeadm, have a validity period of one year. If these certificates expire, the cluster will experience communication failures, leading to:
Inability to connect to the Kubernetes API server using kubectl.
Failure of pods to start or communicate correctly.
Disruption of cluster operations and potential loss of functionality.
Technology Leaders has 6 ways to deal with expiring certificates as follows:
10 Years Validity - proactively replace existing certificates with 10 years validity.
Regular Kubernetes Upgrades - kubeadm can automatically renew certificates during upgrades.
Monitoring Tools - Put a monitoring tool such as SysDig and Dynatrace to be proactively notified before certificates expires. Opensource tools like Prometheus, Grafana, or custom scripts can be configured to achieve this
Use Cert Manager in Clusters - clusters that utilize Cert-Manager, it can be configured to automatically manage and renew certificates within the cluster.
Regular certificate check - as part of monitoring routine of engineers is using openssl command, or a custom script.
Decentralized Kubernetes Clusters
Decentralized Kubernetes clusters occur when large organizations, each with various business units, operate numerous Kubernetes clusters. Each business unit offers distinct products and services to both internal and external customers, hosting these offerings as microservices, frontends, and backends across at least three environments. Consequently, they might manage around 100 clusters, leading to operational challenges in IT. Engineers perform numerous maintenance tasks in Kubernetes, including disk optimization, applying patches, fixing vulnerabilities, handling upgrades, and managing certificates. This workload increases with every new Kubernetes cluster that is established.
To address this type of issue, a technology leader should adopt a two-step strategy. The first step involves consolidating Kubernetes clusters that are considered permissible. For instance, merging Development, Quality Assurance, and UAT environments can greatly reduce the overall footprint of your Kubernetes clusters. You also have the option to combine pre-production and production. Another approach is to either set up new Kubernetes clusters or integrate existing ones into a centralized management platform for containers, such as Rancher. This centralized platform offers a unified interface to govern, manage, upgrade, monitor, and maintain Kubernetes clusters.
Chargeback to Business Unit
Kubernetes supports multi-tenancy for applications, allowing applications from various business units to coexist with logical separation via namespaces and node groups. This benefits enterprises by maximizing infrastructure investments through economies of scale and allocating resources to more utilized pods. The challenge lies in determining how to charge back the CAPEX and OPEX of operating Kubernetes clusters to the business units, particularly since resource consumption varies across namespaces for different business units. Dividing costs equally among business units would be unfair, especially for those using fewer resources in the cluster.
A technology leader must invest in an specialized tool that is capable of aggregating the cost consumed through tagging and namespace grouping. The tool should compute the cost consumption of business units at the pod, infrastructure, and network layer while producing a comprehensive cost summary. This is not your typical Cost Usage Report in AWS. Leaders need a tool like Cloud Health or SysDig Monitor. These tools provides predictable cost analysis and savings estimates for Kubernetes.
Kubernetes Performance Tuning
Consider an enterprise company that handles millions of transactions for customers via their booking platform, e-commerce, or payment processing. If this system suddenly becomes unresponsive or experiences transaction timeouts, the company faces revenue losses for each hour of downtime spent troubleshooting. Only to discover the issue was due to PostgreSQL hosted on Kubernetes, which is slowing down due to an unoptimized Persistent Volume. To maintain optimal performance, enterprises dedicate approximately 30% of their engineering efforts to fine-tuning Kubernetes, particularly in self-hosted clusters. This represents significant overhead and requires specialized skills. Beyond maintenance, performance issues with Kubernetes add pressure on top technology executives during C-Level meetings. Nonetheless, there are four basic strategies that technology leaders can employ to tackle performance tuning challenges in Kubernetes clusters.
Managed Service Kubernetes Cluster - if you can host Kubernetes in public cloud do so. Through this approach technology executives can focus consuming the service rather than keeping the service alive and at optimal performance.
Resource Request and Limits - by setting the pods minimum resource limits while keeping the maximum default-undefined. Kubernetes can allocate resources to Pods requiring higher resources by temporarily allocating Requests Units from Pods that does not require one at the moment.
Leverage on Pod to Pod Networking - all pods in Kubernetes can communicate locally. Thus micro services either through service mesh or in-built Kubernetes SDN communication would reduce latency.
Fine-tune Persistent Volume - starts with ensuring you got to choose the right storage class. Using local storage volume for self-hosted clusters will further improve performance. Utilize dynamic provisioning for PersistentVolumeClaims (PVCs). This automates PV creation based on PVC requests, ensuring storage is provisioned only when needed and can be tailored to specific requirements, leading to better resource utilization and potentially improved performance compared to pre-provisioned static PVs.