Midnight Mayhem: Lessons from a 2 AM Outage
From panic to perspective: Lessons from a midnight production fire.
I was just getting started as a software engineer when it all went down.
It was 2 AM when my phone rang. It was Raj. He sounded anxious—really anxious.
“Bro, store is down!”
"Store" was one of our core APIs. If it was down, it wasn't just a bad thing—it was disastrous. I jumped on a call with the rest of the team. Ram, one of our engineers, had deployed a new version a few hours earlier. Naturally, all eyes turned to him. But he wasn’t convinced it was the cause. It was just a minor change, after all, and had been thoroughly tested in our dev environment.
Back then, we were still running on VMs—no autoscaling, no Kubernetes. Scaling meant manually provisioning new virtual machines. (We couldn’t use VMSS due to a design decision that, thankfully, we fixed later during our Kubernetes migration.)
Ram debugged for about 30 minutes. At 2:30 AM, we tried a rollback—but the issue persisted. I noticed our latency alarms had gone off, and total request volume had spiked. The culprit? One of our biggest merchants was running a sale. The traffic surge was real.
Ram was understandably annoyed—we’d spent all that time suspecting the code, when the real issue was traffic.
At this point, I made the call: we needed to scale horizontally. Ram, Raj, and Dhanush looked on, probably wondering if I had a plan—and I did. I spun up two new VMs and added their IPs to our inventory.
But adding nodes wasn’t just about spinning up a few instances. I had to run our Terraform scripts to provision the new VMs, update the Ansible inventory and playbooks, and even tweak our deployment pipeline to accommodate the changes. All of this was time-consuming—time we didn’t really have. We finally deployed our services onto the new nodes, and despite the grind, everything went relatively smoothly.
That is, until I tried deploying an auxiliary service.
This is where things got messy. I ran into dependency error after dependency error. Every fix surfaced another problem. The config was outdated and incompatible with the newer VM image. Turns out, we had version-locked it, and I was now running something newer. There was a helper script to handle this, but I had no idea where it was.
It was 3:30 AM. I’d already called Rishin a few times for help with node setup. This time, he joined the call.
I explained the whole situation. He asked, casually:
“Did you try vertical scaling?”
I froze.
I hadn’t even considered it.
Tunnel vision had led me down the horizontal scaling rabbit hole. But the fix was simple—obvious, even. I immediately upgraded the VM tier. Traffic was handled, latency dropped, and the alarms stopped. Just like that, the problem was resolved.
Takeaways:
Don’t jump to conclusions. Not every incident is caused by the most recent code change.
Tunnel vision is real. When you're in crisis mode, it’s easy to get stuck on one solution path.
Know your tools. Scaling strategies (horizontal vs. vertical) should be in your mental toolbox, always.
Documentation matters. That helper script could’ve saved me an hour—if I knew where it was.
Communication saves time. Bringing in the right person at the right time made all the difference.
It was a long night—but it taught me lessons I’ll never forget.
😂😂
Fun read!