What this is.
Experience + Opinion + Fact (50% experience · 30% opinion · 15% fact · 5% fiction)
Written in collaboration with AI — I discuss, I do not outsource.

I was hired once to sit between a Fortune 500 hardware company and the external vendor writing all their firmware. The mandate was simple: review every pull request, catch every defect before production. After three months I had caught hundreds of defects. After six months, the same class of bug was still showing up in a different module every sprint. The gate was working perfectly. Quality was not getting any better.

Chapter 1. Hired to Hold the Quality Gate

A Fortune 500 hardware company. An external vendor doing all the firmware and embedded software. A clear mandate from the customer side: my job was to sit between the two and protect the production line.

Review every pull request. Read every diff. Block what should not ship. The vendor wrote the code; I owned the gate 🚧.

For a senior engineer who had spent 20+ years building products, the role looked obvious. More eyes, fewer escapes. Same logic the industry has trusted for decades.

First principle. When you are hired to enforce quality, the first move always looks obvious — review more, catch more. That is not wrong. It is just incomplete.

So I did exactly what the role asked for. The next chapter is what that actually looked like, week after week.

Chapter 2. The Obvious Solution — Review Everything

The first few months were a grind. Pull requests, inline comments, back-and-forth threads, sometimes the same one twice 😅.

A missing null check that could corrupt a state machine on edge input. Caught it. Written up. Fixed. An implicit assumption in a sensor driver — the kind that works fine in the lab and surfaces at 2 a.m. on the factory floor 🏭. Caught it. Written up. Fixed. A misuse of a shared buffer that would have caused intermittent crashes once the device was thermally stressed. Caught it. Written up. Fixed.

The numbers looked good. Defects per sprint went down. The vendor team was responsive. The customer side was happy with my reports. By every dashboard the program tracked, the gate was working.

First principle. More thorough review catches more defects. That is true. It is just not sufficient.

If catching defects was the only thing that mattered, this would be the end of the story. It was not.

Chapter 3. Same Class. Different Module. Every Sprint.

It took me a couple of quarters to see what was actually happening. The bugs I was catching were never literally the same bug 🔍. It was always the same class of bug, appearing in a new module, written by a different engineer, in a different subsystem, every sprint.

A null check missing here. An ownership boundary left implicit there. A state machine embedded in code that was never designed to hold state. A shared buffer in a third place. Different files, different authors, identical shape.

The dashboard kept saying we were getting better. The pattern said something different.

This is the part most quality programs miss. They count instances. They do not count classes. The instances were going down because we were catching them. The classes — the structural reasons those bugs were possible at all — never moved.

First principle. A recurring class of defect is a structural signal. It is the architecture telling you something — not the engineer.

Once I saw the pattern, the question changed. The next chapter is what changed with it.

Chapter 4. Review Is a Lagging Indicator

At some point I stopped asking "how do I catch this faster?" and started asking "why does this keep appearing?"

The answer was not in the code. It was in the architecture — or the absence of one.

W. Edwards Deming made the same point about manufacturing decades earlier 📐. Inspection at the end of the line, he argued, is too late — by the time the inspector sees a defect, the conditions that produced it are already baked into the process. Quality has to be designed in upstream. The same logic shows up in Toyota's poka-yoke (mistake-proofing): make the wrong thing physically impossible to do, and you stop needing to catch it.

Firmware is no different. A code review is an inspection station at the end of the line. The conditions that made the defect possible — implicit ownership, scattered state, ambiguous boundaries — were already baked in by the time the engineer wrote the line of code I was reviewing.

First principle. Review is a lagging indicator. Architecture is the leading one. Fix the leading indicator and the lagging one takes care of itself.

So we did the experiment. Same engineers, same deadline, different question.

Chapter 5. The Shift That Changed Everything

We stopped reviewing for defects and started reviewing for structure 💡.

Does this module have a clear ownership boundary? Does this state machine live in one place? Does this driver make its assumptions explicit? Is the right thing easier to write than the wrong thing? The questions in code review changed. The conversations changed.

And then something I did not expect happened. The same engineers, with the same deadline pressure, started writing categorically better code. Not slightly better. Different.

The class of bug that had been appearing every sprint stopped appearing. Not because we got better at catching it. Because the structure of the code no longer made it natural to write.

The review queue got shorter. The team's morale climbed. The dashboards finally started telling the truth.

First principle. Quality is not a property of the code. It is a property of the system the code lives in.

If you have ever been hired to "fix quality," here is the framework I would have given my younger self on day one.

Architecture-First Quality

Four questions to ask before a single line of new code is written 📐:

1. Does this module have a clear ownership boundary? Who owns what, where does state live, what is the interface? If the boundary is implicit, defects will eventually exploit it.

2. Does this state machine live in one place? State scattered across modules is state that drifts. One place per FSM, one source of truth per piece of state.

3. Does this driver or component make its assumptions explicit? Implicit assumptions are the ones that break at 2 a.m. on the factory floor 🏭. Write them down, in code or comments — not in the writer's head.

4. Is the right thing easier to write than the wrong thing? If a developer has to remember to do X every time, X will eventually be forgotten. Make it structural — types, interfaces, boundaries that make the wrong path harder to take than the right one.

When the answer to all four is yes, the same team writes categorically better code. Without harder review. Without longer sprints. Without anyone working extra hours.

The Goal Is Not a Better Gate

You cannot review your way to quality. You can review your way to slightly fewer defects per sprint, at the cost of your own time and the team's morale 🎯. Or you can spend the same energy upstream — on architecture, on boundaries, on making the right thing easy to write — and let quality emerge structurally.

The gate caught the defects. The structure stopped writing them.

Next: the architecture decision most teams never make explicitly — and the bug pattern that keeps showing up because of it.


Labeled: Experience + Opinion + Fact
(50% experience · 30% opinion · 15% fact · 5% fiction)

Sources:

  • W. Edwards Deming — Out of the Crisis (1986), on inspection vs. quality designed upstream
  • Toyota Production System — poka-yoke (mistake-proofing) principle

(Written in collaboration with AI — I discuss, I do not outsource.)

New to this labeling? Read the framework → 20+ Years of Ideas. Articulation Is the Craft.

— Ritesh | ritzylab.com