Kubernetes Storage Follies: A Chaos Engineer's Confession
Authored by Daniel Bunte, Last updated: 2025-09-28

Let me start with a confession: I’m a Kubernetes noob.
Now, before you roll your eyes and close this tab, hear me out. In one of my most recent projects, I worked as a chaos engineer on one of Norway’s most ambitious energy grid projects. My job was to break things: kill pods, introduce network failures, create poison pills - all in the name of improving system resilience. I was the guy who made sure your lights stayed on by making sure our systems could survive almost anything.
But here’s the thing about chaos engineering: you’re usually working with systems that someone else built and configured. You’re the storm, not the architect. When it came to actually setting up Kubernetes from scratch, configuring storage, and dealing with the nitty-gritty details? Well, let’s just say my comfort zone was Ansible, Docker, and bare metal servers that did exactly what I told them to do.
That’s why I chose MicroK8s. It promised simplicity. It promised to abstract away the complexity. It promised that someone like me, comfortable with containers but relatively new to the kubernetes administration game, could get up and running quickly. Famous last words.
The “Good Enough” Philosophy Strikes Again
Before I dive into this week’s storage saga, I need to address the elephant in the room: my relationship with the “good enough for now” philosophy.
As a solo founder building Vioro in public, I’ve repeatedly chosen to “Ship fast, fix later” over doing it perfectly from day one. “I’ll just use this quick solution for now,” I tell myself. The problem? “Later” has this annoying habit of arriving sooner than expected, usually accompanied by production issues and a healthy dose of regret. This week was one of those times.
Enter OpenEBS Jiva (Exit Sanity)
My “good enough” storage solution was OpenEBS Jiva. On paper, it looked perfect: dynamic storage provisioning, replication, all the buzzwords I thought I wanted.
The problem with my setup is that I’m running MicroK8s across multiple virtual servers that are geographically separated. Why? Because the small VPSes have the best cost/value ratio, and I could take advantage of special offers with double the disk space. I ended up with two x86_64 nodes (one in South Germany, one in Austria) and an additional arm64 node, initially unaware of how much these short distances—or the mixed architecture—would strain a solution like OpenEBS Jiva.
I chose this setup because I thought it was a clever way to get redundancy on a budget. The issue was a basic lack of knowledge about zones and regions: the few milliseconds of latency between these nodes meant that my storage replicas were about as synchronized as a flash mob organized via carrier pigeon. Disk synchronization just… didn’t.
After watching my logs fill with sync errors and my applications randomly lose data, I realized my “good enough” solution had become a “definitely not good enough” problem.
The Nuclear Option: Time to Embrace the Wasteland
There was no coming back, no quick fix. I had to go full Vault-Tec on the cluster. (You know, Vault-Tec - those folks who built the bunkers in the Fallout universe - so people could survive in case of a nuclear disaster.)
When your storage layer is unreliable, there’s really only one solution: tear everything down and start over. I had to nuke every single component that relied on a Persistent Volume Claim (PVC):
- Rauthy (our authentication service): Gone
- Arc-runners (our CI/CD pipeline): Obliterated
- PostgreSQL with CloudNativePG: Vaporized
- Vioro backend: Reduced to atoms
My beautiful cluster was suddenly a ghost town. Nothing but system pods, some nginx workers for the website, and crushing self-doubt.
The Great Storage Bake-Off
What followed was the storage equivalent of a reality TV cooking show.
Contestant #1: Longhorn I had high hopes for Longhorn. Rancher’s solution seemed mature and designed for my use case. It failed spectacularly. Pods wouldn’t start, volumes wouldn’t mount, and the logs were about as helpful as a chocolate teapot.
Contestant #2: Rook-Ceph Finally, in desperation, I turned to Rook-Ceph. Getting Ceph to work required a ritual that would make ancient shamans jealous.
For each and every server in my cluster, I had to:
- Gracefully drain the node from Kubernetes (the polite way to kick it out)
- Shut down the server completely
- Boot into rescue mode (because apparently, in 2025, we’re still doing this)
- Manually resize the root partition to 100GiB for OS and … stuff …
- Create a dedicated Ceph partition using all the remaining space.
- Reboot and pray to whatever deities govern distributed storage systems
- Rejoin the cluster and hold my breath
Oh, and somewhere in this process, I learned that microceph and ceph are completely different things. I accidentally thought microk8s would come with micro-ceph, which apparently can automatically use free space without partitions. So I tried this, wondered why it didn’t work, learned the difference, and had to do many steps twice, once I recognized it’s about ceph, not micro-ceph. That particular misunderstanding cost me several hours of confusion and at least three cups of coffee.
The Miracle
Here’s the shocking part: it actually worked.
After updating all my Ansible scripts and Kubernetes configurations, everything hummed along beautifully. Postgres started up, authentication worked, CI/CD pipelines resumed, and my Rust-based applications could actually persist data without randomly losing it. Even the rauthy-backup worked beautifully.
Fortunately, I found these issues early, and not in half a year from now, with hundreds of customers on the system. I also got to test recovery - check.
💡 What Does Best Practice Look Like? (The SRE Answer)
If you’re reading this and wondering what the “right” way to do this is (because my Rook-Ceph across South Germany and Austria solution is certainly not it) here is the official, SRE-approved best practice for reliable Kubernetes stateful storage:
- Managed Services First: The gold standard is almost always to use a cloud provider’s managed service (like RDS or Cloud SQL). They handle the persistence, replication, and disaster recovery so you don’t have to.
- I didn’t want to spend such money without having a big enough customer base.
- Keep Your Storage Local: If you must roll your own storage (like Ceph or Longhorn), all nodes in the storage cluster must reside in the same data center or availability zone (AZ). Why? To ensure sub-millisecond, low-latency communication for synchronous data replication. The few milliseconds of latency between my Germany and Austria nodes were the silent killer of OpenEBS Jiva.
- Well, without having set up Availability Zones before, I thought to myself, why should I have two clusters for distribution, when I can distribute them geographically in one cluster - that was a false assumption, as we’ve seen.
- Dedicated Storage Nodes: For critical applications, separate the worker nodes that run your application pods from the nodes that run your storage services (Ceph monitors/OSDs) to prevent resource contention.
- I’m sure I’ll do this once I have that load. For now, the Rust applications are very resource-friendly.
I’m doing none of this. But as a solo founder, the best practice is often to find the solution that allows you to keep shipping product without breaking the bank.
Lessons in Humility
This whole experience has been a humbling reminder of the difference between breaking things and building things. As a chaos engineer, I was really good at finding weaknesses in systems. But actually architecting resilient systems from scratch? That’s a completely different skill set.
My distributed setup is weird and probably wrong, but it’s what I can afford right now, and it gets the job done. The “good enough for now” philosophy has bitten me enough times that I’m starting to learn. But for now, it’s stable, it’s working, and I can focus on building the actual product instead of fighting with storage systems … until “good enough isn’t, well, good enough” that is.
What’s Next?
The plan is to keep building on this slightly-cursed-but-functional foundation while documenting every mistake along the way. If you’re a solo founder dealing with similar infrastructure challenges, know that you’re not alone in making questionable architectural decisions in service of getting something—anything—working.
Next week, I’ll probably be writing about some other “temporary” solution that came back to haunt me. Such is the life of building in public.
Stay tuned for more infrastructure confessions from the Vioro trenches. And remember: if you’re not occasionally questioning your life choices at 2 AM while reading Kubernetes documentation, you’re probably not doing startup engineering right.
Building Vioro in public means sharing the messy, imperfect reality of startup engineering. Follow along for more honest takes on the intersection of chaos engineering, questionable architecture decisions, and the eternal struggle between “do it right” and “ship it fast”.
