How I Ended Up Building kubeguard

2026-06-10

For the longest time, I was struggling to offload everything to developers, specially validation of manifests that were created by them using my templates.

A version bump here. A tag update there. A small tweak to a values file. But still dramatic? Why?

Those "small" changes were modifying live Kubernetes deployments across environments. A single line in a tag-*.yaml file could ship a new container image (which doesn't even exist - lol) to staging — or worse, production.

And most of those PRs were reviewed like regular code. But it shouldn't have.

The Pattern I Kept Seeing

My setup wasn't unusual.

Base Helm templates lived in one repository.
Environment-specific values lived in another.
Each deployment often had multiple values files: <app>.yaml, env-<app>.yaml, tag-<app>.yaml.

When someone updated just one file — say tag-sc-an-internal-api.yaml — the PR showed a one-line diff.

Looks safe, and usually is. But...

Except Helm doesn't render from diffs. It renders from the entire merged values context. Which means that single-line change depended on the base chart, other values files, cluster configuration, service account bindings, AWS annotations, secrets, security groups, and ALB certificates. None of that context was visible in the PR.

The First "Oh." Moment

I had a PR that was deploying an internal API (low traffic, high impact, very straightforward). It passed review in minutes. Later I realized:

An annotation referenced an ALB certificate that wasn't present in that environment.
The secret referenced was not present in the AWS account.
One of the security groups referenced was not valid for the target VPC.

Nothing exploded. But I had to spend more than 15 minutes to go through every issue that was coming after fixing the previous one that too refreshing and syncing argocd.

That's when I realized something uncomfortable: I was treating infrastructure like static YAML, when in reality it was environment-aware code.

Why Existing Tools Weren't Enough

I tried the usual suspects: kubeval, kube-score, conftest, OPA policies. They're powerful. But they operate on manifests in isolation. They don't know which cluster you're deploying to, what permissions it has, whether that security group belongs to the correct VPC, whether the ALB certificate ARN is valid, or whether referenced secrets actually exist in AWS.

Most importantly, they don't understand Helm the way teams actually use it. My structure was multi-file, multi-environment, and split across repositories. And that's where things got tricky.

"Why Not Just Use Kyverno?"

This was a fair question. I already use Kyverno in-cluster. Kyverno is excellent at enforcing policies at admission time: block privileged containers, enforce required labels, validate resource limits, mutate objects, enforce security standards.

But here's the key difference:

Kyverno runs after you apply. kubeguard runs before you merge.

Kyverno protects the cluster. kubeguard protects the PR. That difference changes everything.

With Kyverno: a PR merges, CI/CD applies the manifest, the admission controller evaluates it, and if it violates policy, the deployment fails. That's safe — but reactive. You're debugging during deploy time.

kubeguard shifts that feedback earlier:

A PR is opened.
kubeguard fetches the full repo at that exact commit SHA.
It resolves all values files.
It renders the full Helm chart.
It validates against cluster state, Ingress attributes, and AWS resources, and everything that can fail.
It comments directly on the PR.

No failed deploy. No half-applied rollout. No CI noise. Just:

"Hey — this security group doesn't exist on the same VPC as that of the cluster."
Or: "This ALB certificate ARN is invalid for this environment."

Kyverno enforces guardrails at the cluster boundary. kubeguard enforces context at the review boundary. They complement each other — but they solve different problems.

The Real Problem

GitHub PRs only show what changed. But Helm rendering requires everything.

If a PR modifies only:

Deployments/staging/platform/tag-sc-an-internal-api.yaml

You still need <app>.yaml, env-<app>.yaml, the base Helm chart, and the full repository at that exact commit SHA. I realized my validator couldn't look at diffs. It had to:

Fetch the full PR head commit.
Checkout the repository at that SHA.
Resolve all required values files.
Merge them.
Render Helm.
Then validate against real cluster + AWS context.

Anything less would be inaccurate.

So I Built kubeguard

I just wanted safer, less messy PRs, which I can go ahead and merge without thinking of going to my CD and checking if the deployment is working fine.

The whole thing is vibe coded (in under 4 hours) — and honestly, that's what makes it exciting. The ability to just go ahead and build exactly the tool your infrastructure needs, without waiting for a vendor to add it or a team to prioritize it, is one of the most underrated things about where AI-assisted development has landed. You have a gap, you fill it, it quite literally is "that simple".

kubeguard became a GitHub App written in Python (FastAPI) that acts as a PR webhook listener, Helm rendering engine, environment-aware validator, AWS context checker, and developer-friendly PR commenter — all in one stateless service.

It does things like:

Check if referenced service accounts exist.
Validate ALB certificate ARNs.
Ensure security groups belong to the correct VPC.
Verify secret references exist.
Detect invalid tolerations.
Render using full repository state — not diffs.

And it reports findings in a way that developers can understand. No node names. No infra noise. Just: is this change safe or risky?

How the Rule Engine Works

kubeguard runs 20+ rules across six categories: Resource Safety, Availability, Scheduling, Security, Networking, and Helm-specific checks. Each finding is scored and the total is capped at 100.

The scoring is environment-aware. In production, findings keep their original severity and the PR fails if any CRITICAL finding exists or the score exceeds 70. In nonprod, each finding is downgraded one level (CRITICAL→HIGH, HIGH→MEDIUM, etc.) and the threshold is relaxed to 85.

A typical CLI run against a local chart looks like this:

kubeguard-analyze analyze -c ./chart -v values.yaml -e nonprod

And the output:

## Kubernetes Helm Risk Report (local)

Chart:       /path/to/charts/my-app
Environment: nonprod
Score:       20/100 (LOW)

Passed checks
  ✅ CPU request set
  ✅ Memory request set
  ✅ Multiple replicas
  ✅ Readiness probe set
  ✅ Not privileged
  ✅ Ingress has TLS
  ... (15 more)

🟡 MEDIUM:
  - [Deployment/my-app] Container has no resource limits
  - [Deployment/my-app] No PodDisruptionBudget found
  - [Deployment/my-app] No securityContext set

🟢 LOW:
  - [Deployment/my-app] No podAntiAffinity (pods may schedule on same node)

Recommendation: Review findings and improve where needed.

For teams with split repos — chart templates in one place, environment values in another — the central_chart resolver handles the three-repo setup, cloning the chart from its source and pulling values from the PR repo before rendering.

Score-Gated Auto-Merge

Once the score was reliable, the next question was obvious: if kubeguard already knows whether a PR is safe, why are humans still clicking merge?

So I wired the score into the merge pipeline. kubeguard can run as a CLI command in a CI job, but I use it as a webhook — a GitHub App running as a long-lived server. When a PR is opened, GitHub sends a pull_request webhook to kubeguard — it clones, renders, validates, and posts the Check Run result entirely outside of the PR's Actions pipeline. No runner minutes. No job to wait for in the CI queue.

The flow is straightforward:

A PR is opened against the deployment branch.
GitHub fires the pull_request webhook to the kubeguard server — this happens in parallel to any CI jobs, not as part of them.
kubeguard clones the repo at the PR's head SHA, resolves all values files, renders the Helm chart, scores the findings, and posts a Check Run directly via the GitHub API.
If the score is below the threshold and there are no CRITICAL findings, the Check Run passes.
A separate GitHub Actions workflow — triggered by the check_run event, not by the PR itself — watches for the kubeguard Check Run to complete.
On pass, it calls the GitHub API to auto-merge the PR into the target branch.

The GitHub Actions side is minimal:

on:
  check_run:
    types: [completed]

jobs:
  auto-merge:
    if: >
      github.event.check_run.name == 'kubeguard' &&
      github.event.check_run.conclusion == 'success'
    runs-on: ubuntu-latest
    steps:
      - name: Auto-merge PR
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          PR=$(gh pr list --head "${{ github.event.check_run.head_sha }}" \
            --json number --jq '.[0].number')
          gh pr merge "$PR" --squash --auto

For nonprod branches this runs fully automatically. For prod, I kept a required human approval as an additional gate — the auto-merge only fires once that approval is also present. kubeguard's Check Run is one of two required status checks; both need to pass before the merge goes through.

The result: low-risk tag bumps and config patches on nonprod merge themselves within seconds of the PR opening. No one has to watch a queue. The infra team only gets pulled in when something actually needs review.

The Design Principle I Kept Coming Back To

Context > Syntax

YAML validity isn't enough. A manifest can be perfectly valid and still fail at runtime, bind to the wrong security group, reference a missing secret, or target the wrong cluster.

kubeguard is built around contextual validation. It doesn't just ask: "Is this YAML correct?" It asks: "Is this safe for this environment?"

The (Un)Expected Side Effect

Once I had this running on every PR, infrastructure deployment and reviews became less dependent on the infra team, so less boring work.

I stopped going to ArgoCD to check if something is problematic, or at Kyverno if something is missing, so that is a cool improvement, time saving as well, and very dev friendly.

And also, planning to add a small dashboard — just SQLite, nothing fancy — to track PRs scanned, pass/fail rate, common violations, and open PR count. Not because I needed to track things (I do ofcourse xD), but why the hell not?...

Final Thought

Infrastructure isn't YAML.

It's state. It's identity. It's permissions. It's environment. It's side effects.

PR diffs don't show that.

Kyverno protects the cluster. kubeguard protects the merge.

And I built it because I learned, slowly, that catching things earlier is calmer than fixing them later.

The project is open: github.com/gagan1510/kubeguard.