440 million in 45 minutes: 7 signs your service is about to fail

Before we dive in, let me tell you every SRE's favorite horror story.

440 million lost in 45 minutes

It happened on the morning of August 1, 2012. Knight Capital Group, one of Wall Street's largest trading firms, was deploying a software update.

They had 8 servers. The operations team updated the software on 7 of them... but forgot one.

The problem? The new code reused an old "flag" that was supposedly no longer in use. But on that eighth, un-updated server, that "flag" activated an old test function called Power Peg, designed to aggressively buy stocks regardless of price.

At 9:30 AM the market opened. Within the first second, the system started sending erroneous orders. The engineers could see the chaos but had no idea what was happening. They didn't have an automated kill switch. They had to trace the problem manually while money evaporated.

Result: In just 45 minutes, they lost 440 million dollars. The company went bankrupt and was sold shortly after.

It wasn't a hacker. It wasn't an earthquake. It was a flawed manual deployment process and accumulated technical debt.

Sound far-fetched? Maybe you won't lose millions, but the root causes of that disaster are the same ones that take your service down on a random Tuesday.

7 unmistakable signs your service is about to fail

Today we're going to talk about a topic that fascinates me, but has also given me countless headaches throughout my career: incidents.

I'm sure you know the story: service down, everyone running around like headless chickens, and you wondering why the same thing keeps happening. The reality is that disasters rarely come "out of nowhere." There are almost always early signals, silent alarms we ignore until it's too late.

This week I wanted to compile the 7 unmistakable signs that your service is about to fail. If your company shows 4 or more of these, watch out: the incident isn't a possibility, it's a certainty (and it will probably catch you on vacation).

Here they are:

1. The "Bus Factor" is 1 (Knowledge silos)

Is there a part of your infrastructure that "only so-and-so touches"? If you have a critical service that only one person knows how to operate, you have a ticking time bomb. The question isn't whether it will fail, but when (spoiler: it'll be when that person is offline).

2. The anxiety of deploying on Friday afternoon

That last-minute rush... "It's done, just push it and let's go." Deploying on a Friday at 5:30 PM without proper testing is playing Russian roulette with your weekend. If you can't guarantee quality, wait until Monday. Your future self will thank you.

3. Poor monitoring: Just logs and CPU

If your monitoring only checks whether the CPU is at 100% or reads logs, you're flying blind. You need business metrics and response times. If you don't know what's really happening, when everything crashes, all you can do is restart and pray (and that never works long-term).

Weekly Newsletter

Enjoying what you read?

Join other engineers who receive reflections on career, leadership, and technology every week.

This newsletter is written in Spanish.

4. The endless onboarding (2 months or more)

If a new hire takes two months to become productive, your system is either too complex or terribly documented. That same complexity is what will make you take forever to fix a bug in a legacy part of the code.

5. Ignoring postmortems

Humans are the only animal that trips over the same stone twice, but engineers shouldn't allow it. If you had a serious incident and didn't document the causes or follow through on actions to prevent it from recurring, you're practically begging for it to happen again.

6. The Hero Syndrome

Is it always the same person who saves the day? That's not good. It creates a brutal dependency and prevents the team from taking ownership. On top of that, the hero ends up burned out and unable to go on vacation in peace. Encourage rotation.

7. Outdated runbooks

There's nothing worse than a "How to fix X" document that's obsolete. In the middle of chaos, following instructions that don't work doesn't just fail to help -- it wastes precious time as you try to figure out what's wrong with the manual itself.

A 440-million-dollar incident almost always starts with a technical decision nobody documented properly. To raise your team's rigor and avoid these scenarios, here's the RFC template I personally use to propose and document critical changes:

Does this sound familiar?

If you've seen these signs in your day-to-day work, it's time to act. Don't wait for the system to collapse. Investing in observability, documentation, and team culture is the only way to sleep soundly.

Newsletter Content

This content was first sent to my newsletter

Every week I send exclusive reflections, resources, and deep analysis on software engineering, technical leadership, and career development. Don't miss the next one.

Join over 5,000 engineers who already receive exclusive content every week