How to Operate an API Gateway Effectively

A Journey towards Operational Excellence through Ownership and Data

Jun 17, 2024

This is the story of how my team came to own an API gateway. I will share how we've achieved operational excellence while serving hundreds of millions of requests each month. For new readers of this blog, it's worth mentioning that I'm part of an IAM team.

When you think about the meaning of "IAM" - Identity and Access Management, it all comes down to these two pillars:

authentication, aka “Who are you?”
authorization, aka “What are you allowed to do?”

The core of the IAM domain are entities like users, roles, and permissions. It might be surprising to hear an IAM team owning an API gateway too. Does that make any sense?

Some time ago, our core product teams couldn't deliver new features fast enough anymore. This was because of the monolithic architecture. That classical “break the monolith” decision was due.

Yet, one can only extract capabilities if they continue to support existing authentication mechanisms. So, the IAM architecture needed to change as well. Building custom authentication for each team creates incompatible solutions and duplicates work. You need a centralized IAM solution for both: core product and new capabilities.

The first big architecture change was adding an API gateway. Now, instead of reaching the main product directly, customers accessed the product through a central entry point. This gateway is key for security. Furthermore, it enables each team to rebuild their backend without disrupting customers by applying the strangler pattern.

Introducing an API Gateway to an existing Architecture

The IAM team picked Kong for the job, as they needed a gateway compatible with all major cloud providers. After the rollout to production, they handed the gateway over to a central infrastructure platform team. Unfortunately, this platform team was already overwhelmed with ongoing tasks. As a result, the gateway's maintenance was neglected. Over time, the decision to choose Kong became blurry, and there was no plan to evolve the solution further.

It's around this time when I got promoted and joined the IAM team as the Staff Engineer. Customer traffic continued to increase. 502/503 HTTP errors and high API response latencies appeared. Meanwhile, my team got dragged into incidents because we're the go-to experts for troubleshooting API gateway problems.

Moreover, to modernize the IAM architecture as planned, we would need to integrate deeply with the API gateway. Any changes on Kong would need active support and approvals of the owning team. Likewise, changes that we've requested from the platform team were not picked up due to competing priorities and capacity constraints. It was time to rethink our approach.

At this point, I had thought about owning the API gateway for quite some time already. It wasn't something that I wanted to do, but I felt it was something that our team needed to do.

Turning the Ship Around

The situation in the platform team wouldn't change anytime soon. I was convinced that taking over end-to-end responsibility for this critical component would enable us to:

gain back control to prioritize operational excellence
reduce API gateway incidents
refine and clarify the long-term strategy of the API gateway
remove dependencies and move faster on rearchitecting the IAM stack

When I shared the idea of owning Kong with my team, the engineers were not enthusiastic. We've already had a big IAM rearchitecture project to lift. Why would we add more complexity to our plate? There were also discussions around focusing on our core domain. After all, the API gateway is an infrastructure component to be managed by platform teams, right?

It was clear that owning such a critical component comes with overhead to operate. Since the gateway processes every single customer request, we've feared that we would be pulled into more API incidents.

Additionally, it would need significant investment to remove accumulated technical debt. I compiled a backlog of topics that we'd needed to address. Together with my manager, we used this backlog as leverage. We declared that we'd be happy to take over full ownership of the API gateway if we get an extra engineer for our team. Our plan turned out successful, and we moved ahead with hiring. This was a significant achievement during a time of cost reduction where most engineering teams were shrinking.

In the IAM team, we've sat together and prioritized the Kong backlog. We agreed to focus on high-impact improvements that do not take months to deliver. This way, we could earn trust as a team and prove that we are capable of owning such a critical component. Within the first six month we have achieved magnificent results:

enhanced observability of Kong by creating a new dashboard following the RED method. The new dashboard pinpoints the source of latency and errors by clearly distinguishing between Kong and upstream services
improved alerts for Kong to proactively notify us in case of any issues
created a best practice runbook for troubleshooting of 502/503 HTTP errors and latencies, applicable to all teams. As other teams started to use the runbook and dashboard our involvement into incidents decreased
upgraded Kong to the latest major version, reducing the time it takes to proxy customer requests
refactored automation to update Kong routes, reducing the pressure on the Kong Admin API. Outcome: improved reliability of Kong configuration management at runtime (e.g., when routes are provisioned for a new customer)
developed a performance test framework tailored for Kong to simulate high traffic and improve horizontal scalability
reduced infrastructure costs of Kong by over $100K annually. Added proactive cost tracking going forward. Checkout my blog post about these cost reductions: Six Steps to Reduce Costs of your Kubernetes Workload
Created an architecture decision record that explains why we chose Kong as the API Gateway and how it fits into the target architecture. Actively shared this knowledge and addressed concerns within the company

Improved Observability: Kong Proxy vs Upstream Latency

Each of these achievements stands out for itself. The biggest achievement, though, is that we haven't had a single incident in our team since we started owning Kong. What a win for our customers and our team! 🎉

I’m proud and grateful for how my team demonstrated the “disagree and commit” value. Our fear of being overwhelmed by the operational burden did not turn into reality. Instead, we are confidently serving hundreds of millions monthly requests with Kong.

TL;DR

Using a platform service can enable your team to focus on their core domain (I wrote about this previously here). However one size does not fit all, right? If using a shared platform slows down delivery, you should reconsider the approach. Analyse if priorities of both teams are still aligned and suit your needs. Sometimes priorities have diverged too much. In our case, taking full control over the API gateway eliminated handovers and led to operational excellence. My team can now redesign and deliver a new IAM stack with minimal dependencies. 🚀

Over to you: Did you ever struggle to collaborate with a platform team? What is a “must have” of every platform service? 🤔

Engineering Decompiled

Discussion about this post