What this is.
Opinion + Experience + Fact (45% opinion · 40% experience · 10% fact · 5% fiction)
Written in collaboration with AI — I discuss, I do not outsource.
Chapter 1. The Cost We Treat As The Job
The last post ended on a promise: tables, contracts, observability — these compound. There is one cost they compound against, and it is the largest hidden cost in firmware: debugging 🪲.
This post is that cost, written down.
Every firmware team I have worked with treats debugging as the job. It is the part of the week the senior engineer expects to spend. It is the line in the schedule everyone quietly inflates. It is the answer when a manager asks why the next feature slipped. Across many years and many products, debugging has been a significant share of every project I have shipped — and most of that share is structural, not skill.
The team is strong. The engineers are sharp. The code passes review. And the calendar still bends around the same shape every quarter — a few good weeks of feature work, then a stretch where one bug sits across three engineers and the rest of the roadmap waits.
This is the debugging tax. Every team pays it. Most teams treat it as a property of the work itself. It is, in fact, a property of the architecture the team has chosen to live inside ⏳.
▸ First principle. Debugging is the largest hidden cost in firmware, and most of that cost is structural, not skill.
Chapter 2. Where The Hours Actually Live
The first time a team measures where its debugging hours actually go, the answer surprises everyone in the room 📊.
I have run this exercise on small teams and large ones. The shape is the same. The hours split into three buckets, and the proportions are nothing like what the team would have guessed.
The smallest bucket is fixing. Once the bug is understood, the fix is usually small — a few lines, sometimes a one-line guard, sometimes a careful rewrite of a state transition. The fix lands inside a day for almost every bug I have watched. This is the bucket the team's mental model treats as the whole job. It is the bucket that fits in everyone's head, because the fix is the part everyone takes pride in.
The next bucket up is instrumenting after the fact. The bug is happening somewhere, but the log does not say where. The engineer adds a printf. Reflashes. Waits. The bug reappears. Adds another printf. Reflashes. Waits. Each cycle is small, but the cycles compound into days. By the time the engineer can see the device's state at the moment of failure, a week has passed.
The largest bucket — by a wide margin — is reproducing the bug. The device works on the bench. The bug only shows up in the field, or only after running for hours, or only under a load the lab does not have. Half the engineer's week is waiting for the bug to surface again so the new instrumentation can catch it. This is the bucket the team's mental model treats as nothing — five minutes here, ten minutes there, all of it invisible because no one is sitting at a keyboard typing 💤.
The team that has not measured this assumes the fix is most of the work. The team that has measured this knows the fix is a rounding error compared to the time spent reproducing and instrumenting ⚖️.
▸ First principle. The fix is the smallest bucket. Reproducing the bug is the largest. The team that measures this for the first time stops debating tooling priorities for the rest of the quarter.
Chapter 3. A Sprint I Watched
A few years ago I sat next to a team chasing a field bug on a connected industrial sensor 🔧.
The product was solid. The team was three senior engineers and a strong tech lead. The bug was a once-a-week restart on a small subset of devices, with no obvious cause in the logs. The serial log on a restarted device showed the same line every time: WDT: reset. Watchdog timeout. Nothing else.
The sprint began on a Monday with the tech lead's calm sentence: "Let's get this one before Friday." The team estimated three days for the fix.
The first two days went into reproducing the bug. The team ran the same firmware on three devices in the lab. None of them reproduced. They pulled two affected devices from the field and ran them in the lab. One reproduced after eleven hours. The other ran for two weeks and stayed quiet.
Day three went into adding instrumentation. The team added a richer log around the watchdog kicker. Reflashed both devices. Day four was waiting. Day five was reading the new logs and discovering the instrumentation had not captured the specific moment of the timeout, because the watchdog reset cleared the log buffer before it was flushed. They redesigned the instrumentation to write to non-volatile storage and reflashed again. Day eight was waiting. Day ten was reading the second log and finding it.
The fix took ninety minutes. A semaphore was being held by a task that occasionally entered a long-running computation when a specific sensor reading aligned with a specific time-of-day window. The kicker task starved. The watchdog reset 🕯️.
Ten days from the team's best engineers. The fix took ninety minutes. The other nine and a half days were reproducing and instrumenting — the two largest buckets, doing what the architecture made the team do.
The retrospective that Friday was honest. The team's mental model going in had assumed the fix was the hard part. The actual experience was the opposite. The instrumentation that mattered was the one that already had to exist, before the bug was filed.
▸ First principle. The instrumentation that pays off is the one that already exists before the bug is filed. Adding it after the fact charges the team rent on every bug.
Chapter 4. The Cost Shape, On Paper
Here is the shape on paper. No specific numbers — just the relative size of each bucket, the way it tends to land across the products I have shipped.
WHERE A WEEK OF DEBUGGING ACTUALLY GOES
─────────────────────────────────────────────────────────
REPRODUCING THE BUG LARGEST BUCKET
████████████████████████████████████
waiting for the bug to surface again,
arranging the field-like conditions,
re-running the load
INSTRUMENTING AFTER THE FACT LARGE BUCKET
██████████████████████████
adding printfs, reflashing, watching,
iterating on the next round of context
the team did not have last time
FIXING (once understood) SMALLEST BUCKET
███████████
reading the captured context,
proposing a change, code-reviewing,
shipping the patch
─────────────────────────────────────────────────────────
Three things stand out when the team sees this for the first time.
The largest bucket is also the most invisible — reproducing the bug looks like waiting, and a calendar of "waiting" is hard to argue for budget against. The middle bucket is the one the team thinks is the whole job — adding instrumentation feels like real work, because the engineer is at the keyboard. The smallest bucket is the one everyone agrees is engineering — the fix, the review, the patch.
The shape stays roughly the same whether the bug took two weeks or two days. The proportions move a little. The smallest bucket stays the smallest 🎯.
▸ First principle. The visible buckets are the smallest. The largest bucket is invisible. The cost shape stays the same across team size and bug size.
Chapter 5. What Reclaims The Hours
The move that reclaims the hours is the same move the last few posts have been pointing at, said differently each time 📚.
Observability captured by default — every state transition, every message route, every fault context, every boot — turns the second-largest bucket (instrumenting after the fact) into something close to zero. The instrumentation already exists. The next bug is filed against a stream that already has the data the engineer needs.
The largest bucket — reproducing — shrinks at the same time, for a related reason. When the field device sends back a structured event stream rather than a single line of free-form serial output, the bug often does not need to be reproduced at all. The captured stream is the reproduction. The engineer reads the last 30 seconds of the device's behavior and proposes the fix without ever connecting to a unit in the lab 📡.
A structured application layer above the RTOS — the layer that owns the bus, the contracts, the FSM tables, and the event taxonomy — is what makes capture-by-default a property of the architecture rather than a per-product project. The team writes the event taxonomy once. Every product inherits it. The next product's debugging tax is paid in the architecture, not in the calendar.
This is not a tooling story. It is an architectural one. A team can adopt a fancy logging library and still pay the debugging tax in full if the events the library captures are unstructured strings the engineer has to add after the fact. The reverse is also true — a team with a small, plain-C event taxonomy written once at the application-layer boundary can cut the tax in half before the first bug is filed.
▸ First principle. Capture-by-default observability is what shrinks the two largest debugging buckets at once. It is an architectural property, not a tooling one.
Chapter 6. The Two Tax Brackets
There are two tax brackets, and it is worth naming both honestly so the conversation stays grounded 💰.
The small bracket. A device with a few hundred lines of firmware, one engineer, a bench cable always attached. The debugging tax exists, but it is small — the bug surfaces in seconds, the printf is added in minutes, the fix lands in hours. The architectural investment to reduce the tax further is not earned at this scale. The team should ship the product and move on.
The large bracket. A device with tens of thousands of lines of firmware, multiple engineers, a fleet in the field, a long product life, multiple variants planned. The debugging tax is the largest single cost on the team's calendar and most of the team has not seen it that way yet. The architectural investment to reduce the tax — capture-by-default observability, typed contracts, FSM tables — earns back inside the first quarter, every quarter after that, for the life of the product.
Most of the teams I work with are in the second bracket and assume they are in the first. The product was small at the start. The team was small. The bench was right there. The bracket moved while the team's habits stayed where they were 🚪.
The cost is paid in either case. The architecture decides whether the team pays it once, in design time, or pays it forever, in calendar time.
▸ First principle. Every team is in a tax bracket. The first bracket can stay where it is. The second bracket pays the tax in design time, or it pays it forever in calendar time.
Chapter 7. What I Would Measure This Quarter
If I were on a firmware team this Tuesday morning, here is the single measurement I would run this quarter 📌.
For the next ten bugs the team closes, fill in three numbers in the ticket. Reproduce hours — wall-clock time from "bug reported" to "bug seen on a device the team controls." Instrument hours — wall-clock time from "bug seen" to "the engineer has enough context to reason about the cause." Fix hours — wall-clock time from "cause understood" to "patch merged." Three numbers. Three minutes of overhead per ticket.
At the end of the quarter, sum each column. The team will see its actual tax shape for the first time. The shape will surprise people. The senior engineers will say "yes, that matches my gut." The newer engineers will say "I had no idea where my week was going." The tech lead will be able to point at the largest bucket and decide what to invest in next quarter.
The investment that follows from the measurement is almost always the same one — capture-by-default observability, written into the architecture above the RTOS, applied across the product line. The shape of the measurement makes the case the architectural argument does not need to make 🔎.
The cost is invisible until it is measured. The architecture is invisible until it is named. Both become visible in the same week, on the same team, with the same three numbers per ticket.
If your team closed ten bugs and wrote down these three numbers — which bucket would surprise you most when the sum landed?
Next: there is a category of firmware bug that only surfaces in the field — and the move that closes the gap is older and simpler than most teams reach for. Host simulation.
▸ First principle. The debugging tax stays invisible until it is measured. Three numbers per ticket is the cheapest way to make it visible — and the case for the architectural fix writes itself the moment the sum is on the table.
Labeled: Opinion + Experience + Fact (45% opinion · 40% experience · 10% fact · 5% fiction)
Sources:
- Previous post — Why the Best Firmware State Machines Live in Tables — the FSM tables that the captured events trace through
- Earlier post — From Printf to Observability: The Step That Changes Debugging — the bounded event taxonomy that powers capture-by-default
- Earlier post — The Layer Above Your RTOS Has a Name — capture-by-default observability as one of the five elements of the application architecture layer
(Written in collaboration with AI — I discuss, I do not outsource.)
New to this labeling? Read the framework → 20+ Years of Ideas. Articulation Is the Craft.
— Ritesh | ritzylab.com
#EmbeddedSystems #Firmware #Debugging #Observability #FirstPrinciples
Stay in the loop
New essays on embedded systems, firmware quality, and engineering craft. No noise.
Discussion
No comments yet. Be the first to share your thoughts.
Leave a comment