We stopped roadmap work for a week and fixed bugs

242 points by lalitmaganti a day ago

I love the idea, but this line:

> 1) no bug should take over 2 days

Is odd. It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.

That said, unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one.

Also, I tend to attack bugs by priority/severity, as opposed to difficulty.

Some of the most serious bugs are often quite easy to find.

Once I find the cause of a bug, the fix is usually just around the corner.

lkbm 5 minutes ago

A big reason we did a "fix week" at my old job was to deal with all the simple, low priority issues. Sure, there were high severity bugs, but they would get prioritized during normal work, whereas fix week was to prevent death of a thousand cuts. Kinda trivial things that just accumulate and make the site look and feel janky.
Some things turn out to be surprisingly complex, but you can very often know that the simple thing is simple.
muixoozie 9 hours ago

I worked for a company that.. Used msql sever a lot and we would run into a heisenbug every few months that would crash our self hosted msql server cluster or it would become unresponsive. I'm not a database person so I'm probably butchering the description here. From our POV progress would stop and require manual intervention (on call). Back and forth went on with MS and our DBAs for YEARS pouring over logs or whatever they do.. Honestly never thought it would be fixed. Then one time it happened and we caught all the data going into the commit and realized it would 100% reproduce the crash. Only if we restored the database to a specific state and with this specific commit it would crash MS SQL Server. NDAs were signed and I took machete to our code base to create a minimal repro binary that could deserialize our data store and commit / crash MS SQL sever. Made a nice powershell script to wrap it and repro the issue fast and guess what? Within a month they fixed it. Was never clear on what exactly the problem was on their end.. I got buffer overflow vibes, but that's a guess.
- DanielHB 5 hours ago
  
  I once ran into a bug where our server code would crash only on a specific version of the Linux Kernel under a specific version of the OpenJDK that our client had. At least it would crash at startup but it was some good 2 weeks of troubleshooting because we couldn't change the target environment we were deploying on.
  At least it crashed at startup, if it was random it would have been hell.
- newtwilly 7 hours ago
  
  Wow, that's pretty epic and satisfying
Aurornis 4 hours ago

All of the buggy software projects I've been employed to work on have had some version of this rule.
Usually it's implicit, rather than explicit: Nobody tells you to limit work on bugs to 1-2 days, but if you spend an entire week debugging something difficult and don't accumulate any story points in Jira, a cadre of project manager, program managers, and other manager titles you didn't even know existed will descend upon you and ask why you're dragging the velocity down.
Lesson learned: Next time, avoid the hard bugs and give up early if something isn't going to turn into story points for hidden charts that are viewed by more people than you ever thought.
- bottlero_cket 2 hours ago
  
  Lesson learned, just avoid the hard bugs, I don’t think that is feasible for most of us!
- kccqzy 4 hours ago
  
  I hate this kind of management culture that misuses story points. Story points are supposed to take into account difficulty. So if you spend an entire week debugging a difficult bug, you should’ve accumulated about the same amount of story points as colleagues debugging ten easy bugs.
  - oldestofsports 3 hours ago
    
    Every one have different approaches to story points and every one thinks their way is ”the right way”. In the end they just turn into an abstraction layer for man hours.
- aeternum 4 hours ago
  
  It's the right lesson because the difficulty of the bug often depends on the dev. For example it might take one dev weeks to figure out that a hang due to a sleep(.001) call within asyncio whereas another can identify it with a glance at the code.
  - 1718627440 2 hours ago
    
    Which is why they get payed different rates.
QuiEgo 9 hours ago

As someone who works with hardware, hard to repo bugs can take months to track down. Your code, the compiler, or the hardware itself (which is often a complex ball of IP from dozens of manufacturers held together with a NoC) could all be a problem. The extra fun bugs are when a bug is due to problems in two or three of them combining together in the perfect storm to make a mega bug that is impossible to reproduce in isolation.
- QuiEgo 8 hours ago
  
  Random example: I once worked on a debug where you were not allowed to send zero length packets due to a known HW bug. Okay fine, work around in SW. Turns out there was an HW eviction timer that was disabled. It was connected to a counter that counted sys clk ticks. Turns out it was not disabled entirely properly due to SW bug, so once every 2^32 ticks, it would trigger an evection, and if the queue happened to be empty, it would send a ZLP, which triggered the first bug (hard hang the system in a way that breaks the debugger). There were dozens of ways that could hard hang the system, this was just one. Good luck debugging that in two days.
  - jeffreygoesto 8 hours ago
    
    We had one where data, interpreted as address (simple C typo before static analysis was common) fell into an unmapped memory region and the PCI controller stalled trying to get a response, thereby also halting the internal debugging logic and JTAG just stopped forever (PPC603 core). Each time you'd hit the bug, the debugger was thrown off.
kykat 19 hours ago

Sometimes, a "bug" can be caused by nasty architecture with intertwined hacks. Particularly on games, where you can easily have event A that triggers B unless C is in X state...
What I want to say is that I've seen what happens in a team with a history of quick fixes and inadequate architecture design to support the complex features. In that case, a proper bugfix could create significant rework and QA.
- arkh 15 hours ago
  
  > Sometimes, a "bug" can be caused by nasty architecture with intertwined hacks
  The joys of enterprise software. When searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do what they do how they do. Because they've gone away many years ago.
  - fransje26 13 hours ago
    
    > searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do [..]
    And then comes the "beginner's" mistake. They don't seem to be doing anything. Let's remove them, what could possibly go wrong?
    
    HelloNurse 11 hours ago
    
    If you follow the prescribed procedure and involve all required management, it stops being a beginner's mistake; and given reasonable rollback provisions it stops being a mistake at all because if nobody knows what the thing is it cannot be very important, and a removal attempt is the most effective and cost efficient way to find out whether the ting can be removed.
    
    Retric 9 hours ago
    
    > a removal attempt is the most effective and cost efficient way to find out whether the ting can be removed
    Cost efficient for your team’s budget sure, but a 1% chance of a 10+ million dollar issue is worth significant effort. That’s the thing with enterprise systems the scale of minor blips can justify quite a bit. If 1 person operating for 3 months could figure out what something is doing there’s scales where that’s a perfectly reasonable thing to do.
    Enterprise covers a while range of situations there’s a lot more billion dollar orgs than trillion dollar orgs so your mileage may very.
    
    HelloNurse 7 hours ago
    
    If there is a risk of a 10+ million dollar issue there is also some manager whose job is to overreact when they hear the announcement that someone wants to eliminate thing X, because they know that thing X is a useful part of the systems they are responsible for.
    In a reasonable organization only very minor systems can be undocumented enough to fall through the cracks.
    
    Retric 5 hours ago
    
    In an ideal world sure, but knowledge gets lost every time someone randomly quits, dies, retires etc.
    Stuff that’s been working fine for years is easy for a team to forget about, especially when it’s a hidden dependency in some script that’s going to make some process quietly fail.
    
    chrisweekly 8 hours ago
    
    Well, maybe. See Chesterson's Fence^1
    [1] https://theknowledge.io/chestertons-fence-explained/
    
    amalcon 7 hours ago
    
    I have had several things over the course of my career that:
    1) I was (temporarily) the only one still at the company who knew why it was there
    2) I only knew myself because I had reverse engineered it, because the person who put it there had left the company
    Now, some of those things had indeed become unnecessary over time (and thus were removed). Some of them, however, have been important (and thus were documented). In aggregate, it's been well worth the effort to do that reverse engineering to classify things properly.
    
    Mtinie 8 hours ago
    
    If it’s done in a controlled manner with the ability to revert quickly, you’ve just instituted a “scream test[0].”
    ____
    [0] https://open.substack.com/pub/lunduke/p/the-scream-test
    (Obviously not the first description of the technique as you’ll read, but I like it as a clear example of how it works)
    
    notTooFarGone 12 hours ago
    
    I've fixed more than enough bugs by just removing the code and doing it the right way.
    Of course you can get lost on the way but worst case is you learn the architecture.
    
    xnorswap 9 hours ago
    
    The next mistake is thinking that completely re-writing the system will clean out the cruft.
    
    fragmede 12 hours ago
    
    that's a management/cultural problem. if no one knows why it's there, the right answer is to remove it and see what breaks. If you're too afraid to do anything, for nebulous cultural reasons, you're paralyzed by fear and no one's operating with any efficiency. It hits different when it's the senior expert that everyone revere's that invented everything the company depends on that does it, vs a summer intern vs Elon Musk bought your company (Twitter). Hate the man for doing it messily and ungraciously, but you can't argue with the fact that it gets results.
    
    ljm 12 hours ago
    
    This does depend on a certain level of testing (automated or otherwise) for you to even be able to identify what breaks in the first place. The effect might be indirect several times over and you don't see what has changed until it lands in front of a customer and they notice it right away.
    Move fast and break things is also a managerial/cultural problem in certain contexts.
    
    mschuster91 10 hours ago
    
    > It hits different when it's the senior expert that everyone revere's that invented everything the company depends on that does it, vs a summer intern vs Elon Musk bought your company (Twitter). Hate the man for doing it messily and ungraciously, but you can't argue with the fact that it gets results.
    You can only say with a straight face that if you're not the one responsible to clean up after Musk or whatever CTO sharted across the chess board.
    C-levels love the "shut it down and wait until someone cries up" method because it gives easy results on some arbitrary KPI metric without exposing them to the actual fallout. In the worst case the loss is catastrophic, requiring weeks worth of ad-hoc emergency mode cleanup across multiple teams - say, some thing in finance depends on that server doing a report at the end of the year and the C-level exec's decision was made in January... but by that time, if you're in real bad luck, the physical hardware got sold off and the backup retention has expired. But when someone tries to blame the C-level exec, said C-level exec will defend themselves with "we gave X months of advance warning AND 10 months after the fact no one had complained".
    
    faidit 10 hours ago
    
    It can also be dangerous to be the person who blames execs. Other execs might see you as a snake who doesn't play the game, and start treating you as a problem child who needs to go, your actual contributions to the business be damned. Even if you have the clout to piss off powerful people, you can make an enemy for life there, who will be waiting for an opportunity to blame you for something, or use their influence to deny raises and resources to your team.
    Also with enterprise software a simple bug can do massive damage to clients and endanger large contracts. That's often a good reason to follow the Chesterton's fence rule.
  - silvestrov 14 hours ago
    
    plus report servers and others that run on obsolete versions of Windows/unix/IBM OS plus obsolete software versions.
    and you just look at this and thinks: one day, all of this is going to crash and it will never, ever boot again.
  - groestl 11 hours ago
    
    And then it turns out the bug is actually very intentional behavior.
- ChrisMarshallNY 19 hours ago
  
  In that case, maybe having bug fixing be a two-step process (identify, then fix), might be sensible.
  - OhMeadhbh 18 hours ago
    
    I do this frequently. But sometimes identifying and/or fixing takes more than 2 days.
    But you hit on a point that seems to come up a lot. When a user story takes longer than the alloted points, I encourage my junior engineers to split it into two bugs. Exactly like what you say... One bug (or issue or story) describing what you did to typify the problem and another with a suggestion for what to do to fix it.
    There doesn't seem to be a lot of industry best practice about how to manage this, so we just do whatever seems best to communicate to other teams (and to ourselves later in time after we've forgotten about the bug) what happened and why.
    Bug fix times are probably a pareto distribution. The overwhelming majority will be identifiable within a fixed time box, but not all. So in addition to saying "no bug should take more than 2 days" I would add "if the bug takes more than 2 days, you really need to tell someone, something's going on." And one of the things I work VERY HARD to create is a sense of psychological safety so devs know they're not going to lose their bonus if they randomly picked a bug that was much more wicked than anyone thought.
    
    ljm 7 hours ago
    
    I like to do this as a two-step triage because one aspect is the impact seen by the user and how many it reaches, but the other is how much effort it would take to fix and how risky that is.
    Knowing all of those aspects and where an issue lands makes it possible to prioritise it properly, but it also gives the developer the opportunity hone their investigation and debugging skills without the pressure to solve it at the same time. A good write-up is great for knowledge sharing.
    
    ChrisMarshallNY 17 hours ago
    
    You sound like a great team leader.
    Wish there were more like you, out there.
marginalia_nu 13 hours ago

I think in general, bugs go unfixed in two scenarios:
1. The cause isn't immediately obvious. In this case, finding the problem is usually 90% of the work. Here it can't be known how long finding the problem is beforehand, though I don't think bailing because it's taking too long is a good idea. If anything, it's those really deep rabbit holes the real gremlins can hide.
2. The cause is immediately obvious, but is an architecture mistake, the fix is a shit-ton of work, breaks workflows, requires involving stakeholders, etc. Even in this case it can be hard to say how long it will take, especially if other people are involved and have to sign off on decisions.
I suppose it can also happen in low-trust sweatshops where developers held on such a tight leash they aren't able to fix trivial bugs they find without first going through a bunch of jira rigmarole, which is sort of low key the vibe I got from the post.
OhMeadhbh 18 hours ago

At Amazon we had a bug that was the result of a compiler bug and the behaviour of intel cores being mis-documented. It was intermittent and related to one core occasionally being allowed to access stale data in the cache. We debugged it with a logic analyzer, the commented nginx source and a copy of the C++ 11 spec.
It took longer than 2 days to fix.
- ChrisMarshallNY 17 hours ago
  
  I’m old enough to have used ICEs to trace program execution.
  They were damn cool. I seriously doubt that something like that, exists outside of a TSMC or Intel lab, these days.
  - buildbot 34 minutes ago
    
    They float around on ebay! Software might be an issue.
  - plq 17 hours ago
    
    ICE meaning in-circuit emulator in this instance, I assume?
    
    ChrisMarshallNY 13 hours ago
    
    Yeah. Guess it’s kind of a loaded acronym, these days.
  - Windchaser 7 hours ago
    
    /imagining using an internal combustion engine here
- amoss 16 hours ago
  
  When you work on compilers, all bugs are compiler bugs.
  (apart from the ones in the firmware, and the hardware glitches...)
- auguzanellato 17 hours ago
  
  What kind of LA did you use to de bug an Intel core?
  - OhMeadhbh 16 hours ago
    
    The hardware team had some semi-custom thing from intel that spat out (no surprise) gigabytes of trace data per second. I remember much of the pain was in constructing a lab where we could drive a test system at reasonable loads to get the buggy behavior to emerge. It was intermittent so it took use a couple weeks to come up with theories, another couple days for testing and a week of analysis before we came up triggers that allowed us to capture the data that showed the bug. it was a bit of a production.
cvoss 17 minutes ago

The article addresses your concerns directly.
> In one of our early fixits, someone picked up what looked like a straightforward bug. It should have been a few hours, maybe half a day. But it turned into a rabbit hole. Dependencies on other systems, unexpected edge cases, code that hadn’t been touched in years.
> They spent the entire fixit week on it. And then the entire week after fixit trying to finish it. What started as a bug fix turned into a mini project. The work was valuable! But they missed the whole point of a fixit. No closing bugs throughout the week. No momentum. No dopamine hits from shipping fixes. Just one long slog.
> That’s why we have the 2-day hard limit now. If something is ballooning, cut your losses. File a proper bug, move it to the backlog, pick something else. The limit isn’t about the work being worthless - it’s about keeping fixit feeling like fixit.
oldestofsports 3 hours ago

> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.
I understood it as the whole point of the 2 day hard limit - you start working on a bug that turn out to be bigger than expected, so you write down your findings and move on to the next one.
PaulKeeble 19 hours ago

Sometimes you find the cause of the bug in 5 minutes because its precisely where you thought it was, sometimes its not there and you end up writing some extra logging to hopefully expose its cause in production after the next release because you can't reproduce as its transient. I don't know how to predict how long a bug will take to reproduce and track down and only once its understood do we know how long it takes to fix.
khannn 12 hours ago

I had a job that required estimation on bug tickets. It's honestly amazing how they didn't realize that I'd take my actual estimate, then multiply it by 4, then use the extra time to work on my other bug tickets that the 4x multiplier wasn't good enough for.
- mewpmewp2 12 hours ago
  
  That's just you hedging, they don't really need to know that. As long as if you are hedging accurately in the big picture, that's all that matters. They need estimates to be able to make decisions on what should be done and what not.
  You could tell them that 25% chance it's going to take 2 hours or less, 50% chance it's going to take 4 hours or less, 75% chance it's going to take 8 hours or less, 99% it's going to take 16 hours or less, to be accurate, but communication wise you'll win out if you just call items like those 10 hours or similar intuitively. Intuitively you feel that 10 hours seems safe with those probabilities (which are intuitive experience based too). So you probably would say 10 hours, unless something really unexpected (the 1%) happens.
  Btw in reality with above probabilities the actual average would be 5h - 6h with 1% tasks potentially failing, but even your intuitive probability estimations could be off so you likely want to say 10h.
  But anyhow that's why story points are mostly used as well, because if you say hours they will naturally think it's more fixed estimation. Hours would be fine if everyone understood naturally that it implies a certain statistical average of time + reasonable buffer it would take over a large amount of similar tasks.
- georgemcbay 7 hours ago
  
  Are you sure they didn't realize it...?
  Virtually everywhere I've ever worked has had an unwritten but widely understood informal policy of placing a multiple on predicted effort for both new code/features and bug fixing to account for Hofstadter's law.
brightball 10 hours ago

In my experience, the vast majority of bugs are quick fixes that are easy to isolate or potentially even have a stack trace associated with them.
There will always be those “only happens on the 3rd Tuesday every 6 months” issues that are more complicated but…if you can get all the small stuff out of the way it’s much easier to dedicate some time to the more complicated ones.
Maximizing the value of time is the real key to focusing on quicker fixes. If nobody can make a case why one is more important than other, then the best use of your time is the fastest fix.
sshine 13 hours ago

> unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one
Race conditions in 3rd party services during / affected by very long builds and with poor metrics and almost no documentation. They only show up sometimes, and you have to wait for it to reoccur. Add to this a domain you’re not familiar with, and your ability to debug needs to be established first.
Stack two or three of these on top of each other and you have days of figuring out what’s going on, mostly waiting for builds, speculating how to improve debug output.
After resolving, don’t write any integration tests that might catch regressions, because you already spent enough time fixing it, and this needs to get replaced soon anyway (timeline: unknown).
ZaoLahma 13 hours ago

> That said, unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one.
The longer I work as a software engineer, the rarer it is that I get to work with bugs that take only a day to fix.
- ChrisMarshallNY 13 hours ago
  
  I've found the opposite to be true, in my case.
  - ZaoLahma 11 hours ago
    
    For me the longer I work, the worse the bugs I work with become.
    Nowadays, after some 17 years in the business, it's pretty much always intermittently and rarely occurring race conditions of different flavors. They might result in different behaviors (crashes, missing or wrong data, ...), but at the core of it, it's almost always race conditions.
    The easy and quick to fix bugs never end up with me.
    
    lll-o-lll 11 hours ago
    
    Yep. Non-determinism. Back in the day it was memory corruption caused by some race condition. By the time things have gone pop, you’re too far from the proximate cause to have useful logs or dumps.
    “Happens only once every 100k runs? Won’t fix”. That works until it doesn’t, then they come looking for the poor bastard that never fixes a bug in 2 days.
    
    ChrisMarshallNY 11 hours ago
    
    My first job was as an RF (microwave) bench technician. My initial schooling was at a trade school for electronic technicians.
    It was all about fixing bugs; often, terrifying ones.
    That background came in handy, once I got into software.
    
    lll-o-lll 11 hours ago
    
    I started life as an engineer. Try reverse engineering why an electrical device your company designed (industrial setting, so big power), occasionally and I mean, really really rarely, just explodes; burying its cover housing half way through the opposite wall.
    Won’t fix doesn’t get accepted so well. Trying to work out what the hell happened from the charred remains isn’t so easy either.
    
    ChrisMarshallNY 10 hours ago
    
    Sounds like some great stories.
    
    ChrisMarshallNY 11 hours ago
    
    The reward for good work, is more work.
    I tend to mostly work alone, these days (Chief Cook & Bottle-Washer).
    All bugs are mine.
chii 19 hours ago

I find most bugs take less time to fix than it takes time to verify and reproduce.
- wahnfrieden 19 hours ago
  
  LLMs have helped me here the most. Adding copious detailed logging across the app on demand, then inspecting the logs to figure out the bug and even how to reproduce it.
  - bluGill 8 hours ago
    
    I did that once: logging ended up taking 80% of the CPU leaving not enough overhead for everything else the system should do. Now I am more careful to figure out what is worth logging at all, and also to make sure disabled logs are quickly bypassed.
    
    dylan604 5 hours ago
    
    we've gotten into adding verbosity levels in logging where each logged event comes with an assigned level that only makes it to the log if it matches the requested log level. there are times when a full verbose output is just too damn much for day-to-day debugging, but is helpful when debugging the one feature.
    i used to think options like -vvv or -loglevel panic were just someone being funny, but they do work when necessary. -loglevel sane, -loglevel unsane, -loglevel insane would be my take but am aware that most people would roll their eyes so we're lame using ERROR, WARNING, INFO, VERBOSE
    
    bluGill 4 hours ago
    
    On smaller projects that works. We have a complex system where individual logs can get the log level changed. Though this turns out too fine grained. I'm moving to every subsystem being controllable, but not the individual logs. I'm still not sure what the right answer is though - it always seems like there are 10,000 lines of unrelated useless logs to wade through before finding the useful one, but anytime I remove something that turns out to be the needed log for the very next bug report...
    
    1718627440 an hour ago
    
    Use something like syslog, where everything is recorded and you can filter on display by subsystem and loglevel.
    
    wahnfrieden 4 hours ago
    
    That's great when you have to maintain a large amount of logs for weeks, months, years.
    But I'm talking about adding and removing logs per dev task. There's really no need to have sophisticated log levels and maintaining them as the app evolves and grows, because the LLM can "instantly" add and remove the logging it needs per granular task. This is much faster for me than maintaining logs and carefully selecting log levels and managing how logs can be filtered. That only made sense to me when it took actual dev effort to add or remove these logs.
    
    wahnfrieden 4 hours ago
    
    You misunderstand: I remove the logging as soon as the task is done. I definitely do not keep the LLM logging around.
    That's the beauty of it - it's able to add and remove huge amounts of logging per task, so I never need to manage the scale and complexity of logging that outlasts the task it was purposefully added for. With typical development, adding logging takes time so we keep it around and maintain it.
    
    bluGill 3 hours ago
    
    One of my needs is when something breaks in the real world I can figure out why. Bugs that happen at my desk I do what you said, add the logs I need and then delete them when it is fixed. However often there are things that I can't figure out how to reproduce at my desk and so I need logs that are always running on the off chance a new bug happens that I need to debug.
    
    wahnfrieden 3 hours ago
    
    Yea that's valid. I do keep some kinds of logs around for this. But I'm selective with it and most logs I don't need to retain to manage this risk.
  - ChrisMarshallNY 17 hours ago
    
    Yes. I often just copy the whole core dump, and feed it into the prompt.
    
    criddell 8 hours ago
    
    This is something that I've been trying to improve at. I work on a Windows application and so I get crash dumps that I open with WinDbg and then I usually start looking for exceptions.
    Is this something an LLM could help with? What exactly do you mean when you say you feed a dump to the prompt?
    
    ChrisMarshallNY 7 hours ago
    
    I literally copy the whole stack dump from the log, and paste it into the LLM (I find that ChatGPT does a better job than Claude), along with something along the lines of:
    > I am getting occasional crashes on my iOS 17 or above UIKit program. Given the following stack trace, what problem do think it might be?
    I will attach the source file, if I think I know the general area, along with any symptoms and steps to reproduce. One of the nice things about an LLM, is that it's difficult to overwhelm with too much information (unlike people).
    It will usually respond with a fairly detailed analysis. Usually, it has some good ideas to use as starting points.
    I don't think "I have a bug. Please fix it." would work, though. It's likely to try, but caveat emptor.
    
    Lionga 16 hours ago
    
    And this kids is how one bug got fixed and two more were created
    
    Sohcahtoa82 5 hours ago
    
    There's a huge difference between using an LLM to assist you versus letting it just do all the work for you. Your implication that they're the same, and that the previous commenter let the LLM do the work, is lazy.
    ChrisMarshallNY only said they fed the dump into the LLM. They said nothing about using the LLM to write the fix.
    
    ChrisMarshallNY 13 hours ago
    
    Nope.
    Good result == LLM + Experience.
    The LLM just reduces the overhead.
    That’s really what every “new paradigm” has ever done.
    
    enraged_camel 9 hours ago
    
    Also, robust test coverage helps prevent regressions.
beberlei 13 hours ago

Its odd at first, but springs from economic principles, mainly sunk cost fallacy.
If you invest 2 days of work and did not find the root cause of a bug, then you have the human desire to keep investing more work, because you already invested so much work. At that point however its best to re-evaluate and do something different instead, because it might have a bigger impact.
Likelihood that after 2 days of not finding the problem, you wont find it after another 2 days is higher than starting over with another bug that on average you find the problem earlier.
- lan321 10 hours ago
  
  This sounds incorrect. You didn't find it but you're gaining domain knowledge and excluding options, hopefully narrowing down the cause. It's not like you're just chucking random garbage at Jenkins.
  Of course, if it's a difficult bug and you can just say 'fuck it' and bury it in the backlog forever that's fine, but in my experience the very complex ones don't get discovered or worked on at all unless it's absolutely critical or a customer complains.
pjc50 13 hours ago

I think the worst case I encountered was something like two years from first customer report to even fully confirming the bug, followed by about a month of increasingly detailed investigations, a robot, and an osciliscope.
The initial description? "Touchscreen sometimes misses button presses".
- ChrisMarshallNY 12 hours ago
  
  Thanks.
  I love hearing stories like this.
  - pjc50 12 hours ago
    
    I'm no Raymond Chen, but sometimes I wish I'd kept notes on interesting bugs that I could take with me when I moved jobs. I've often been the go-to guy for weird shit that is happening that nobody else understands and requires cross-disciplinary insight.
    Other favourites include "Microsoft Structured Exception Handling sometimes doesn't catch segfaults", and "any two of these network devices work together but all three combined freak out".
claw-el 5 hours ago

> Also, I tend to attack bugs by priority/severity, as opposed to difficulty.
This is one part that is rarely properly implemented. We have our bug bash days too, but I noticed after the fact that maybe 1/3 of the bugs we solved is on a feature we are thinking of deprecating soon due to low usage.
How can we attack bugs better by priority?
peepee1982 8 hours ago

Yep. Also, sometimes you figure out a bug and in the process you find a whole bunch of new ones that the first bug just never let surface.
JJMcJ 7 hours ago

It's like remodeling. The drywall comes down. Do you just put up a new sheet or do you need to reframe one wall of the house?
thfuran 7 hours ago

>I can’t imagine spending more than a day on one.
You mean starting after it has been properly tracked down? It can often take a whole lot of time to go from "this behavior is incorrect sometimes" to "and here's what need to change".
- ChrisMarshallNY 7 hours ago
  
  Depends. If it takes a long time to track down, then it should either be sidelined, or the design needs to be revisited.
  I have found that really deep bugs are the result of bad design, on my part, and applying "band-aid" fixes often just kicks the can down the road, for a reckoning (that is now just a bit worse), later.
  If it is not super-serious (small performance issues, for instance; which can involve moving a lot of cheese), I can often schedule a design review for a time when it's less critical, and maybe set up an exploration branch.
  People keep bringing up threading and race conditions, which are legitimately nasty bugs.
  In my experience, they are often the result of bad design, on my part. It's been my experience that "thread everything" can be a recipe for disaster. The OS/SDK will often do internal threading, and I can actually make things worse, by running my own threads.
  I try to design stuff that will work fine, in any thread, which gives me the option to sequester it into a new thread, at a later time (I just did exactly that, a few days ago, in a Watch app), but don't immediately do that.
  - bagacrap 6 hours ago
    
    > If it takes a long time to track down, then it should either be sidelined, or the design needs to be revisited.
    I don't get this. Either you give up on the bug after a day, or you throw out the entire codebase and start over?
    Sure, if the bug is low severity and I don't have a reproduction, I will ignore it. But there are bad bugs that are not understood and can take a lot more than a day to look into, such as by adding telemetry to help track it down.
    Yes, it is usually the case that tracking it down is harder than fixing. But there are also cases where the larger system makes some broad assumptions which are not true, and fixing is tricky. It is not usually an option to throw out the entire system and start over each time this happens in a project.
    
    ChrisMarshallNY 6 hours ago
    
    > you throw out the entire codebase and start over
    Nah. That’s called “catastrophic thinking.” This is why it’s important (in my experience) to back off, and calm down.
    I’ll usually find a way to manage a smaller part of the codebase.
    If I make decisions when I’m stressed, Bad Things Happen.
michaelbuckbee 6 hours ago

Something I often find are "categorical" bugs where it's really 3 or 4 different bugs in a trench coat all presenting as a single issue.
huherto 8 hours ago

I do agree that you should be able to fix most bugs in 2 days or less. If you have many bugs taking longer to fix, it may be an indication that you may have systemic issues. (e.g design, architectural, tooling, environment access, test infrastructure, etc)
- bluGill 8 hours ago
  
  Sure, but you never know if this next bug is another fix it in 1 hour, or it will take months to figure out. I have had a few "The is not spelled 'Teh'" bugs that it takes longer to find the code in question with grep than to fix, but most are a not that obvious and so you don't know if there are 2 hours left or not until 2 hours latter when you know you found something or are still looking. (or unless you think you fixed it and the time to verify the test is about 2 hours, but then only if your fix worked)
AbstractH24 8 hours ago

> Is odd. It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.
Learning how to better estimate how long tasks take is one of my biggest goals. And one I've yet to even figure out how to master
Uehreka 18 hours ago

> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.
In my experience there are two types of low-priority bugs (high-priority bugs just have to be fixed immediately no matter how easy or hard they are).
1. The kind where I facepalm and go “yup, I know exactly what that is”, though sometimes it’s too low of a priority to do it right now, and it ends up sitting on the backlog forever. This is the kind of bug the author wants to sweep for, they can often be wiped out in big batches by temporarily making bug-hunting the priority every once in a while.
2. The kind where I go “Hmm, that’s weird, that really shouldn’t happen.” These can be easy and turn into a facepalm after an hour of searching, or they can turn out to be brain-broiling heisenbugs that eat up tons of time, and it’s difficult to figure out which. If you wipe out a ton of category 1 bugs then trying to sift through this category for easy wins can be a good use of time.
And yeah, sometimes a category 1 bug turns out to be category 2, but that’s pretty unusual. This is definitely an area where the perfect is the enemy of the good, and I find this mental model to be pretty good.
- tonyedgecombe 14 hours ago
  
  >high-priority bugs just have to be fixed immediately no matter how easy or hard they are
  The fact that something is high priority doesn't make it less work.
  - ChrisMarshallNY 11 hours ago
    
    Or more.
    I often find the nastiest bugs are the quickest fixes.
    I have a "zero-crash" policy. Crashes are never acceptable.
    It's easy to enforce, because crashes are usually easy to find and fix.
    $> ThreadingProblems has entered the chat
dockd 6 hours ago

How is this for a rule of thumb: the time it takes to fix a bug is directly related to the age of the software.
- ChrisMarshallNY 4 hours ago
  
  That's also a "It Depends™" thing.
  Really old software can be referred to as "Mature," as opposed to "Decrepit." It can be extremely well-documented, and well-understood. Many times, there are tools that grow up, alongside the main code.
  I wrote stuff that was still in use, 25 years later, because the folks that took it over, did a really good job of maintaining it.
lapcat 19 hours ago

> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.
This is explained later in the post. The 2 day hard limit is applied not to the estimate but rather to the actual work: "If something is ballooning, cut your losses. File a proper bug, move it to the backlog, pick something else."
- ChrisMarshallNY 19 hours ago
  
  Most of the work in finding/fixing bugs is reproducing them reliably enough to determine the root cause.
  Once I find a bug, the fix is often negligible.
  But I can get into a rabbithole, tracking down the root cause. I don’t know if I’ve ever spent more than a day, trying to pin down a bug, but I have walked away from rabbitholes, a couple of times. I hate doing that. Leaves an unscratchable itch.
mobeigi 12 hours ago

I believe the idea is to pick small items that you'd likely be able to solve quickly. You don't know for sure but you can usually take a good guess at which tasks are quick.
yxhuvud 9 hours ago

I've seen people spending 4 months on a hard to replicate segfault.
jorvi 8 hours ago

Yeah, "no bug should take over 2 days" tells me you've never had a race condition in your codebase.
- ChrisMarshallNY 7 hours ago
  
  I'm sure that you're right. I'm likely a bad, inexperienced engineer. There's a lot of us, out here.
ahoka 13 hours ago

Not sure why would you ever need to refactor for fixing a bug?
- nemetroid 2 hours ago
  
  A nice way to fix bugs is to make the buggy state impossible to represent. In cases where a bug was caused by some fundamental flaw in the original design, a redesign might be the only way to feel reasonably confident about the fix.
- ChrisMarshallNY 12 hours ago
  
  Oh, that's because a bug in requirements or specification is usually a killer.
  I have encountered areas where the basic design was wrong (often comes from rushing in, before taking the time to think things through, all the way).
  In these cases, we can either kludge a patch, or go back and make sure the design is fixed.
  The longer I've been working, the less often I need to go back and fix a busted design.
mat0 15 hours ago

you cannot know. that’s why the post elaborates saying (paraphrasing) “if you realize it’s taking longer, cut your losses and move on to something else”
w0m 9 hours ago

> That said, unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one.
oh sweet sweet summer child...
j45 18 hours ago

Bugs taking less than 2 days are great to have as a target but will not be something that can be guaranteed.
- RossBencina 17 hours ago
  
  Next up: a new programming language or methodology that guarantees all bugs take less than two days to fix.
  - wiredfool 3 hours ago
    
    I think this, like many problems, can be reefactored into the halting problem. Which we know how to solve…. Right?
triyambakam 19 hours ago

> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.
Now I find that odd.
- gyomu 19 hours ago
  
  I don’t. I worked on firmware stuff where unexplainable behavior occurs; digging around the code, you start to feel like it’s going to take some serious work to even start to comprehend the root cause; and suddenly you find the one line of code that sets the wrong byte somewhere as a side effect, and what you thought would fill up your week ended up taking 2 hours.
  And sometimes, the exact opposite happens.
- kubb 11 hours ago
  
  You might get humbled by overwhelming complexity one day. Enjoy the illusion of perfect insight until then.
- ChrisMarshallNY 19 hours ago
  
  Yeah, I’m obviously a terrible programmer. Ya got me.
  - triyambakam 18 hours ago
    
    I just find it so oversimplified that I can't believe you're sincere. Like you have entirely no internal heuristic for even a coarse estimation of a few minutes, hours, or days? I would say you're not being very introspective or are just exaggerating.
    
    kimixa 18 hours ago
    
    I think it's very sector dependent.
    Working on drivers, a relatively recent example is when we started looking at a "small" image corruption issue in some really specific cases, that slowly spidered out to what was fundamentally a hardware bug affecting an entire class of possible situations, it was just this one case happened to be noticed first.
    There was even talk about a hardware ECO at points during this, though an acceptable workaround was eventually found.
    I could never have predicted that when I started working on it, and it seemed every time we thought we'd got a decent idea about what was happening even more was revealed.
    And then there's been many other issues when you fall onto the cause pretty much instantly and a trivial fix can be completed and in testing faster than updating the bugtracker with an estimate.
    True there's probably a decent amount, maybe even 50%, where you can probably have a decent guess after putting in some length of time and be correct within a factor of 2 or so, but I always felt the "long tail" was large enough to make that pretty damn inaccurate.
    
    auggierose 12 hours ago
    
    I can explain it to you. A bug description at the beginning is some observed behaviour that seems to be wrong. Now the process starts of UNDERSTANDING the bug. Once that process has concluded, it will be possible to make a rough guess of how long fixing it will take. Very often, the answer then is a minute or two, unless major rewrites are necessary. So, the problem is you cannot put an upfront bound on how long you need to understand the bug. Understanding can be a long winded process that includes trying to fix the bug in the process.
    
    darkwater 12 hours ago
    
    > A bug description at the beginning is some observed behaviour that seems to be wrong.
    Or not. A bug description can also be a ticket from a fellow engineer who knows the problem space deeply and have an initial understanding of the bug, likely cause and possible problems. As always, it depends, and IME the kind of bugs that end up in those "bugathons" are the annoying "yeah I know about it, we need to fix it at some point because it's PITA".
    
    auggierose 11 hours ago
    
    That just means that somebody else has already started the process of understanding the bug, without finishing it. So what?
    
    darkwater 11 hours ago
    
    So you can know before starting to work on the ticket if it's a few minutes boring job, if it could take hours or days or if it's going to be something bigger.
    I can understand the "I don't do estimates" mantra for bigger projects, but ballpark estimations for bugs - even if you can be wrong in the end - should not be labelled as 100% impossible all the times.
    
    auggierose 10 hours ago
    
    Why did the other developer who passed you the bug not make an estimate then?
    I understand the urge to quantify something that is impossible to quantify beforehand. There is nothing wrong with making a guess, but people who don't understand my argument usually also don't understand the meaning of "guess". A guess is something based on my current understanding, and as that may change substantially, my guess may also change substantially.
    I can make a guess right now on any bug I will ever encounter, based on my past experience: It will not take me more than a day to fix it. Happy?
    
    com2kid 18 hours ago
    
    My team once encountered a bug that was due to a supplier misstating the delay timing needed for a memory chip.
    The timings we had in place worked, for most chips, but they failed for a small % of chips in the field. The failure was always exactly identical, the same memory address for corrupted, so it looked exactly like an invalid pointer access.
    It took multiple engineers months of investigating to finally track down the root cause.
    
    triyambakam 17 hours ago
    
    But what was the original estimate? And even so I'm not saying it must be completely and always correct. I'm saying it seems wild to have no starting point, to simply give up.
    
    com2kid 17 hours ago
    
    Have you ever fixed random memory corruption in an OS without memory protection?
    Best case you trap on memory access to an address if your debugger supports it (ours didn't). Worst case you go through every pointer that is known to access nearby memory and go over the code very very carefully.
    Of course it doesn't have to be a nearby pointer, it can be any pointer anywhere in the code base causing the problem, you just hope it is a nearby pointer because the alternative is a needle in a haystack.
    I forget how we did find the root cause, I think someone may have just guessed bit flip in a pointer (vs overrun) and then un-bit-flipped every one of the possible bits one by one (not that many, only a few MB of memory so not many active bits for pointers...) and seen what was nearby (figuring what the originally intended address of the pointer was) and started investigating what pointer it was originally supposed to be.
    Then after confirming it was a bit flip you have to figure out why the hell a subset of your devices are reliably seeing the exact same bit flipped, once every few days.
    So to answer your question, you get a bug (memory is being corrupted), you do an initial investigation, and then provide an estimate. That estimate can very well be "no way to tell".
    The principal engineer on this particular project (Microsoft Band) had a strict 0 user impacting bugs rule. Accordingly, after one of my guys spend a couple weeks investigating, the principal engineer assigned one of the top firmware engineers in the world to track down this one bug and fix it. It took over a month.
    
    snovv_crash 15 hours ago
    
    This is why a test suite and mock application running on the host is so important. Tools like valgrind can be user to validate that you won't have any memory errors once you deploy to the platform that doesn't have protections against invalid accesses.
    It wouldn't have caught your issue in this case. But it would have eliminated a huge part of the search space your embedded engineers had to explore while hunting down the bug.
    
    com2kid 6 hours ago
    
    Custom OS, cross compiling from Windows, using Arm's old C compiler so tools like valgrid weren't available to us.
    Since it was embedded, no malloc. Everything being static allocations made the search possible in the first place.
    This wasn't the only HW bug we found, ugh.
    
    pyrale 15 hours ago
    
    There is a divide in this job between people who can always provide an estimate but accept that it is sometimes wrong, and people who would prefer not to give an estimate because they know it’s more guess than analysis.
    You seem to be in the first club, and the other poster in the second.
    
    arethuza 14 hours ago
    
    It rather depends on the environment in which you are working - if estimates are well estimates then there is probably little harm in guessing how long something might take to fix. However, some places treat "estimates" as binding commitments and then it could be risky to make any kind of guess because someone will hold you to it.
    
    ChrisMarshallNY 10 hours ago
    
    More than some places. Every place I've worked, has been a place where you estimate at your own peril. Even when the manager says "Don't worry. I won't hold you to it. Just give me a ballpark.", you are screwed.
    I used to work for a Japanese company. When we'd have review meetings, each manager would have a small notebook on the table, in front of them.
    Whenever a date was mentioned, they'd quickly write something down.
    Those dates were never forgotten.
    
    arethuza 9 hours ago
    
    "Don't worry. I won't hold you to it. Just give me a ballpark."
    Anytime someone says that you absolutely know they will treat whatever you say as being a commitment written in blood!

etamponi 12 hours ago

Ex-Meta employee here. I worked at reality labs, perhaps in other orgs the situation is different.

At Meta we did "fix-it weeks", more or less every quarter. At the beginning I was thrilled: leadership that actually cares about fixing bugs!

Then reality hit: it's the worst possible decision for code and software quality. Basically this turned into: you are allowed to land all the possible crap you want. Then you have one week to "fix all the bugs". Guess what: most of the time we couldn't even fix a single bug because we were drown in tech debt.

sudoit 6 minutes ago

My experience at Meta is that those weeks are spent not fixing bugs but working on refactors which increase your LOC and diff count. Many of the tasks engs work on are ones they themselves created months previously as "todos".
I question my life a lot when I'm reviewing code which appears to have been written incorrectly at first so that the author can land a follow up diff with the "fix"
mentos 10 hours ago

Reminds me of ids policy of "As soon as you see a bug, you fix it"
"...if you don't fix your bugs your new code will be built on buggy code and ensure an unstable foundation and if you check in buggy code someone else is going to be writing code based on your bad code and well you know you can imagine how wasteful that's going to be"
16:22 of "The Early Days of id Software: Programming Principles" by John Romero (Strange Loop 2022) https://www.youtube.com/watch?v=IzqdZAYcwfY&t=982s
- AdamN 9 hours ago
  
  Yeah, Joel Spolsky is adamant about this with the "Bugs First" approach and he claims most of the delays and garbage that Microsoft released during the early years of his career were centered on that one rule being violated.
  - Wololooo 6 hours ago
    
    The problem is even if you make a note to fix it later, one you never get back to it and two this drives decisions for things around it, until it breaks...
- Aurornis 4 hours ago
  
  > Reminds me of ids policy of "As soon as you see a bug, you fix it"
  If you'll allow me to project a lot of lived experience on to this story: A policy of fixing bugs immediately sounds like a policy software developers would come up with. A policy of deferring bug fixes to a neatly scheduled week on the calendar for bug fixes sounds like a policy some project managers would brainstorm as a way to keep velocity numbers high and get their tickets closed on schedule.
demaga 12 hours ago

From the post:
> That’s not to say we don’t fix important bugs during regular work; we absolutely do. But fixits recognize that there should be a place for handling the “this is slightly annoying but never quite urgent enough” class of problems.
So in their case, fixit week is mostly about smaller bugs, quality of life improvements and developer experience.
- gregoriol 11 hours ago
  
  It must be part of the normal process. If the normal process leaves things like this to "some other time", one should start by fixing the process.
  - IgorPartola 9 hours ago
    
    Say you are working on a banking system. You ship a login form, it is deployed, used by tons of people. Six months later you are mid-sprint on the final leg of a project that will hook your bank into the new FedNow system. There are dozens of departments working together to coordinate deploying this new setup as large amounts of money will be moved through it. You are elbows deep in the context of your part of this and the system cannot go live without it. Twice a day you are getting QA feedback and need to make prompt updates to your code so the work doesn’t stall.
    This is when the report comes in that your login form update from six months ago does not work on mobile Opera if you disable JavaScript. The fix isn’t obvious and will require research, potentially many hours or even days of testing and since it is a login form you will need the QA team to test it after you find another developer on your team to do a code review for you.
    What exactly would you do in this case? Pull resources from a major project that has the full attention of the C suite to accommodate some tin foil Luddite a few weeks sooner or classify this as lower priority?
    
    2arrs2ells 8 hours ago
    
    This is a great example... except I think the right answer to "what exactly would you do in this case?" doesn't support your argument.
    I'd document that mobile Opera with Javascript disabled is an unsupported config, and ask a team to make a help center doc asking mobile Opera users to enable JS.
    
    IgorPartola 4 hours ago
    
    That is also a solution. But the part where you drop everything to immediately document this, and then involve someone else on the team to write more documentation is the exact constraint I was trying to demonstrate. This bug is out of your and your team’s current context. It is low priority. A workaround reply is appropriate here and may have already been sent to the customer by tech support but it is also entirely appropriate to wait a few weeks to complete even what you stated if it is going to affect the company’s bottom line to do it sooner.
    
    CharlieDigital 8 hours ago
    
    This is too logical, practical, and pragmatic. Which product owner/project manager would approve such a thing!?
    Being able to think of simple, practical solutions like this is one of the hardest skills to develop as a team, IMO. Not everything needs to be perfect and not everything needs a product-level fix. Sometimes a "here's the workaround" is good enough and if enough people complain or your metrics show use friction in some journey, then prioritize the fix.
    GP's example is so niche that it isn't worth fixing without evidence that the impact is significant.
    
    BurningFrog 5 hours ago
    
    Two thoughts:
    - This bug genuinely sounds like low priority.
    - This organization seems to operate assuming unforeseen problems will never pop up. That is unwise.
    
    IgorPartola 4 hours ago
    
    Yes exactly. Any non-critical out of current scope bug must be evaluated for whether it should interrupt the current work. Is it a priority? You cannot automate this process by saying “if is_bug: return {priority: IMMEDIATE}” as suggested by the quote about id above, because you will absolutely destroy any velocity. In fact that quote seems to me to be talking about not committing new code with known bugs, not dropping everything as soon as a non-critical bug is discovered in old code.
    Instead you need to have a triage process and a planning process, which to some degree most software teams do. The problem is that most of these processes do not have a rigid way of dealing with really old low priority bugs. A bug fix week is one option for addressing that need.
    
    BurningFrog 2 hours ago
    
    Your argument is only true if you have an infinite number of bugs.
    If you only have a reasonable number of bugs, and fix them as you find them, it's just how you do work.
    It may sound impossible, but I did work like this for two decades, and it worked well for those teams.
    
    IgorPartola an hour ago
    
    No. My argument is valid if you have deadlines and your resources are not infinite. Either you were the only one reporting bugs at which point of course you could fix the as you found them because they were always in your work context or you had no deadlines and could afford to switch context without the inefficiency of it affecting anything.
    In most situations you have users who also find bugs and report them when they want, not when you are ready for them.
    You can even see that your argument does not apply generally by the fact that bugs exist in software for years. If your way was both more efficient AND more aligned with human nature then everyone would be working like this but clearly almost nobody can afford to drop everything to fix a user’s random low priority bug the minute it is reported.
  - tmoertel 4 hours ago
    
    > If the normal process leaves things like this to "some other time", one should start by fixing the process.
    Adding regular fixits is how they fix the normal process.
    This addition recognizes that most bug fixes do not move the metrics that are used to prioritize work toward business goals. But maintaining software quality, and in particular preventing technical debt from piling up, are essential to any long-running software development process. If quality goes to crap, the project will eventually die. It will bog down under its own weight, and soon no good developer will want to work on it. And then the software project will never again be able to meet business goals.
    So: the normal process now includes fixits, dedicated time to focus on fixing things that have no direct contribution to business goals but do, over the long term, determine whether any business goals get met.
inerte 5 hours ago

I agree completely. Also it gives mental excuse to not fix bugs now and leave it for the upcoming bug fix week. Specially if there's any kind of celebration of what was achieved during bug fix week.
It's also patronizing to the devs. "Internal survey shows devs complain about software quality, let's give them a week every quarter and the other 11 we do whatever we want". What needs to change here is leadership being honest about business, as sometimes fixing bugs is simply not important. Sure sure it depends on the bug... I am talking about when devs complain about having a huge number of bugs in the backlog (most of them low impact) or whatever something that only affects a small percentage. Another strategy here would be to properly surface the impact of said bugs to users / customers... until you do this, nobody has a reason to care.
emodendroket 4 hours ago

In practice if you really care about fixing bugs/cleaning things up the thing that works best is sneaking that into feature work somehow.

pmontra 14 hours ago

About stopping and fixing problems, did anybody have had this kind of experience?

1. Working on Feature A, stopped by management or by the customer because we need Feature B as soon as possible.

2. Working on Feature B, stopped because there is Emergency C in production due to something that you warned the customer about months ago but there was no time to stop, analyze and fix.

3. Deployed a workaround and created issue D to fix it properly.

4. Postponed issue D because the workaround is deemed to be enough, resumed Feature B.

5. Stopped Feature B again because either Emergency E or new higher priority Feature F. At this point you can't remember what that original Feature A was about and you get a feeling that you're about to forget Feature B too.

6. Working on whatever the new thing is, you are interrupted by Emergency G that happened because that workaround at step 3 was only a workaround, as you correctly assessed, but again, no time to implement the proper fix D so you hack a new workaround.

Maybe add another couple of iterations but at this time every party are angry or at least unhappy of each other party.

You have a feeling that the work of the last two or three months on every single feature has been wasted because you could not deliver any one of them. That means that the customer wasted the money they paid you. Their problem, but it can't be good for their business so your problem too.

The current state of the production system is "buggy and full of workarounds" and it's going to get worse. So you think that the customer would have been wiser to pause and fix all the nastier bugs before starting Feature A. We could have had a system running smoothly, no emergencies, and everybody happier. But no, so one starts thinking that maybe the best course of action is changing company or customer.

cracki 9 hours ago

Symptoms of a dysfunctional company where communication has broken down, everyone with any authority is running around EXACTLY like a headless chicken, waving around frantically (giving orders). Margins are probably thin as a razor, or non-existent. They will micromanage your work time to death. You will be treated as a commodity factory machine and if you start using your brain to solve actual problems, you will be chastised. Deadlines everywhere keep everyone's brain shut off and in panic mode. No time to properly engineer anything. Nobody has the time to check anyone else's work, causing "trust" that isn't even blind, just foolish. You as the software guy end up debugging and fixing EVERYONE's mistakes. When the bug is in hardware/electronics, everyone knows who's actually to blame, but everyone still expects YOU to fix it, and they're immensely disappointed when you can't save the day.
These places cannot and will not change. If you can, find employment elsewhere.
jamil7 6 hours ago

Yes, it's a leadership failure and probably time to go, it only gets worse. In my experience. It's a vicious cycle where, as velocity slows, inexperienced leadership gets more and more panicked and starts frantically rearranging projects, features and people in desperate attempts to fix the problem, obviously exacerbating the communication breakdown and gridlock further. It also builds resentment and can turn pretty toxic as everyone starts just looking out for themselves.
pjc50 13 hours ago

This is not uncommon but I've mostly managed to avoid it, because it's a management failure. There is a delicate process of "managing the customer" so that they get a result they will eventually be satisfied with, rather than just saying yes to whatever the last phone call was.
dsego 12 hours ago

Yes, usually not worth it to spend too much time on proper engineering if the company is still trying to find a product-market fit and you will be working on something else or deleting the code in a few months.
abroszka33 13 hours ago

> did anybody have had this kind of experience?
Yes, the issue is not you, it's a toxic workplace. Leave as soon as you can.

BurningFrog 19 hours ago

This is weird to me...

The way I learned the trade, and usually worked, is that bug fixing always comes first!

You don't work on new features until the old ones work as they should.

This worked well for the teams I was on. Having a (AFAYK) bug free code base is incredibly useful!!

Celeo 19 hours ago

Depending on the size of the team/org/company, working on anything other than the next feature is a hard sell to PM/PO/PgM/management.
- NegativeK 19 hours ago
  
  I've had to inform leadership that stability is a feature, just like anything else, and that you can't just expect it to happen without giving it time.
  One leader kind of listened. Sort of. I'm pretty sure I was lucky.
  - deaux 12 hours ago
    
    Ask them if they're into pro sports. If so (and most men outside of tech are in some way), they'll probably know the phrase "availability is the best ability".
    
    Herring 5 hours ago
    
    Or just look at your car. Heated seats are sexy in the short term, but boring old reliability and predictability win out long term.
  - dijksterhuis 8 hours ago
    
    i got lucky at my last shop. b2b place for like 2x other customer companies. eng manager person (who was also like 3x other managers :/ ) let everything get super broken and unstable.
    when i took lead of eng it was quite an easy path to making it clear stability was critical. slow everything down and actually do QA. customer became super happy because basically 3x releases went out with minimal bugs/tweaks required. “users don’t want broken changes immediately, they want working changes every so often” was my spiel etc etc.
    unfortunately it was impossible to convince people about that until they screwed it all up. i still struggle to let things “get bad so they can get good”, but am aware of the lesson today at least.
    tl;dr sometimes you gotta let people break things so badly that they become open to another way
    
    machomaster 7 hours ago
    
    It's interesting how misaligned your effort is.
    You put effort into writing an unnecessary tldr on a short post, but couldn't be bothered to properly Capitalize your sentences in order to ensure the readability.
    Weird.
    
    nsingh2 6 hours ago
    
    > Be kind. Don't be snarky. Edit out swipes [1]
    [1] https://news.ycombinator.com/newsguidelines.html
    
    machomaster 4 hours ago
    
    "Please don't post shallow dismissals"
    Same source.
    Don't trivialize my useful feedback.
    If a person tries to communicate, but his stylistic choice of laziness (his own admission!) gets in the way of delivering his message, it is very tangibly useful information to tell, so that the writing effort could be better optimized for effect.
    I wasn't even demanding/telling him what to do. I simply shared my observation, but it's up to him to decide if he wants to communicate better. Information and understanding is power.
    
    dijksterhuis 3 hours ago
    
    > appearing or claiming to be one thing when it is really something else
    https://dictionary.cambridge.org/dictionary/english/ostensib...
    ostensible laziness => not actually laziness.
    although yes it is a stylistic choice (which i wont be changing as the result of our interaction).
    
    dijksterhuis 5 hours ago
    
    > couldn't be bothered to properly Capitalize your sentences
    i changed my iphone settings to not auto-capitalise words
    i put effort into my ostensible laziness
- BurningFrog 19 hours ago
  
  That's what I hear.
  I've had some mix of luck and skill in finding these jobs. Working with people you've worked with before helps with knowing what you're in for.
  I also don't really ask anyone, I just fix any bugs I find. That may not work in all organizations :)
  - ramon156 13 hours ago
    
    I can guarantee you this doesn't work in our team! you didn't make a ticket, so the PM has no idea what you're doing!
    Yes, a ticket takes 2 seconds. it also puts me off my focus :P but i guess measuring is more important than achieving
  - zelphirkalt 14 hours ago
    
    micro-managing middle manager: "Are all your other sprint tasks finished?"
    code reviewing coworker: "This shouldn't be done on this branch!" (OK, at least this is easy to fix by doing it on a separate branch.)
jaredklewis 17 hours ago

Where have you worked where this was practiced if you don’t mind sharing?
I’ve seen very close to bug free backends (more early on in development). But every frontend code base ever just always seems to have a long list of low impact bugs. Weird devices, a11y things, unanticipated screen widths, weird iOS safari quirks and so on.
Also I feel like if this was official policy, many managers would then just start classifying whatever they wanted done as a bug (and the line can be somewhat blurry anyway). So curious if that was an issue that needed dealing with.
- mavamaarten 16 hours ago
  
  I'm not going to share my employer, but this is exactly how we operate. Bugs first, they show up on the Jira board at the top of the list. If managers would abuse that (they don't), we'd just convert them to stories, lol.
  I do agree that it's rare, this is my first workplace where they actually work like that.
- zelphirkalt 13 hours ago
  
  Frontend bugs mostly stem from usage of overblown frontend frameworks, that try to abstract from the basics of the web too much. When relying on browser defaults and web standards, proper semantic HTML and sane CSS usage, the scope of things that can go wrong is limited.
  - DanielHB 4 hours ago
    
    In my experience frontend bugs are usually from over-complicated business logic with layout-issues a distant second.
  - Sharlin 10 hours ago
    
    It's pretty wild that this is the case now (if it indeed is), given that for a long, long time, sticking to sane, standard stuff was the exact way you'd land in a glitch/compatibility hell. Yes, thanks mostly to IE, but still.
- BurningFrog 7 hours ago
  
  I worked at various small agile startups around SF. Retired last year.
  They weren't big enough to have "official policies". We talked to each other instead.
  I did work a few years at big companies twice. That taught me to appreciate the simple life :)
RHSeeger 18 hours ago

Bugs have priorities associated with them, too. It's reasonable for a new feature to be more important than fixing a lower priority bug. For example, if reading the second "page" of results for an API isn't working correctly; but nobody is actually using that functionality; then it might not be that important to fix it.
- tonyedgecombe 14 hours ago
  
  >For example, if reading the second "page" of results for an API isn't working correctly; but nobody is actually using that functionality; then it might not be that important to fix it.
  I've seen that very argument several times, it was even in the requirements on one occasion. In each instance it was incorrect, there were times when a second page was reached.
- AdamN 9 hours ago
  
  IMHO the best way to deal with that situation is to mark the bug as wontfix. Better to have a policy of always fixing bugs but be more flexible on what counts as a bug (and making sure the list of them is very small and being actively worked on).
- x0x0 6 hours ago
  
  I don't think, except for a direct regression, it's even possible to define a bug in a way that isn't the same as a feature request. They're identical: someone wants the software to do X, it doesn't do X, maybe we should make it do X. (Except, again, for it used to do X but now doesn't and that wasn't intentional.)
  Treating bugs as different than features and automatically pushing them to the front of the line likely leads to a non-parsimonious expenditure of effort and sets up some nasty fights with other parts of the company which will definitely figure out that something being a "bug" gets it prioritized. Obviously this can be done poorly, and why even have engineers if you aren't listening to their prioritization as well.
mobeigi 12 hours ago

Any modern system with a sizeable userbase has thousands of bugs. Not all bugs are severe, some might be inconveniences at best affecting only a small % of customers. You have to usually balance feature work and bug fixes and leadership almost always favours new features if the bugs aren't critical to address.
jaredsohn 19 hours ago

I'd love to see an actual bug-free codebase. People who state the codebase in bug-free probably just lack awareness. Even stating we 'have only x bugs' is likely not true.
- NegativeK 19 hours ago
  
  Top commenter's "AFAYK" acronym is covering that.
  The type that claims they're going to achieve zero known and unknown bugs is also going to be the type to get mad at people for finding bugs.
  - supriyo-biswas 18 hours ago
    
    > The type that claims they're going to achieve zero known and unknown bugs is also going to be the type to get mad at people for finding bugs.
    This is usually EMs in my experience.
    At my last job, I remember reading a codebase that was recently written by another developer to implement something in another project, and found a thread safety issue. When I brought this up and how we’ll push this fix as part of the next release, he went on a little tirade about how proper processes weren’t being followed, etc. although it was a mistake anyone could have made.
- skylurk 17 hours ago
  
  https://github.com/kelseyhightower/nocode
- rurban 18 hours ago
  
  We kinda always leave documentation and test bugs in. Documentation teams have different scheduling, and tests are nice TODO's.
  There are also always bugs detected after shipping (usually in beta), which need to be accounted for.
- waste_monk 19 hours ago
  
  >I'd love to see an actual bug-free codebase.
  cat /dev/null .
  - Sharlin 10 hours ago
    
    A specific individual execution is not a codebase.
brulard 12 hours ago

Many of the bugs have very low severity or appear to small minority of users under very specific conditions. Fixing these first might be quite bad use of your capacities. Like misaligned UI elements, etc. Critical bugs should be done immediately of course as a hotfix.
kykat 19 hours ago

In the places that I worked, features came before all else, and bugs weren't fixed unless customers complain
thundergolfer 19 hours ago

This is the 'Zero Defects'[1] mode of development. A Microsoft department adopted it in 1989 after their product quality dropped. (Balmer is cc'd on the memo.)
1. https://sriramk.com/memos/zerodef.pdf
- waste_monk 19 hours ago
  
  As opposed to the current 100% defects approach they seem to have adopted.
  - tonyedgecombe 14 hours ago
    
    Balmer is no longer there :)
Cthulhu_ 10 hours ago

Thing is, if you follow a process like scrum, your product owner will set priorities; if there's a bug that isn't critical, it may go down the list of priorities compared to other issues.
And there's other bugs that don't really have any measurable impact, or only affect a small percentage of people, etc.
ben0x539 19 hours ago

In your experience, is there a lot of contention over whether a given issue counts as a bug fix or a feature/improvement? In the article, some of the examples were saving people a few clicks in a frequent process, or updating documentation. Naively, I expect that in an environment where bug fixes get infinite priority, those wouldn't count as bugs, so they would potentially stick around forever too.
- BurningFrog 17 hours ago
  
  In my world, improving the UI to save clicks is a new feature, not a bug fix.
  Assuming it works as intended.

stevoski 15 hours ago

I’m a strong believer in “fix bugs first” - especially in the modern age of “always be deploying” web apps.

(I run a small SaaS product - a micro-SaaS as some call it.)

We’ll stop work on a new feature to fix a newly reported bug, even if it is a minor problem affecting just one person.

Once you have been following a “fix bugs first” approach for a while, the newly discovered bugs tend to be few, and straight forward to reproduce and fix.

This is not necessarily the best approach from a business perspective.

But from the perspective of being proud of what we do, of making high quality software, and treating our customers well, it is a great approach.

Oh, and customers love it when the bug they reported is fixed within hours or days.

ivolimmen 15 hours ago

Would love to work on a project with this as a rule but I am working on a project that was build before me with 1.2 million lines of code, 15 years old, really old frameworks; I don't think we could add features if we did this.
- chamomeal 12 hours ago
  
  Same. The legacy project that powers all of our revenue-making projects at work is a gargantuan hulking php monster of the worst code I’ve ever seen.
  A lot of the internal behaviors ARE bugs that have been worked around, and become part of the arbitrary piles of logic that somehow serve customer needs. My own understanding of bugs in general has definitely changed.
stevoski 11 hours ago

I wrote up my thoughts on this into a longer post: https://killthehippo.com/posts/fix-bugs-or-add-new-features

dgunay 5 hours ago

An ex-employer of mine had a regular cycle:

    1. Build features at all costs
    2. Eventually a high profile client has a major issue during an event, costing them a ton of goodwill
    3. Leadership pauses everything and the company only works on bugfixes and tech debt for a week or two

I onboarded during step 3. I should have taken that as a warning that that's how the company operated. If your company doesn't make time for bugfixes and getting out of its own way, that culture is hard to change.

Galxeagle 19 hours ago

In my experience, having a fixit week on the calendar encourages teams to just defer what otherwise could be done relatively easily at first report. ("ah we'll get to it in fixit week"). Sometimes it's a PM justifying putting their feature ahead of product quality, other times it's because a dev thinks they're lining up work for an anticipated new hire's onboarding. It's even hinted at in the article ('All year round, we encourage everyone to tag bugs as “good fixit candidates” as they encounter them.')

My preferred approach is to explicitly plan in 'keep the lights on' capacity into the quarter/sprint/etc in much the same way that oncall/incident handling is budgeted for. With the right guidelines, it gives the air cover for an engineer to justify spending the time to fix it right away and builds a culture of constantly making small tweaks.

That said, I totally resonate with the culture aspect - I think I'd just expand the scope of the week-long event to include enhancements and POCs like a quasi hackathon

codingdave an hour ago

> Then the week before fixit, each subteam goes through these bugs and sizes them:

I advocate to never size/score bugs. Instead, if your process demands scores, call everything a 2 because over the course of all the bugs, that will be your average. You'll knock out 10 small ones and then get stuck on a big one. Bug-fixing efforts should be more Kanban than Scrum. Prioritize the most important/damaging/whatever ones, do them in order, and keep doing them until they are done or you run out of time.

a4isms 9 hours ago

In the early days of Hacker News, and maybe even before Hacker News when Reddit didn't have subreddits... OG blogger Joel Spolsky posited the "Joel Test," twelve simple yes/no questions that defined a certain reasonable-by-today's-standards local optimum for shipping software:

https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-s...

Some seem ridiculously obvious today, but weren't standard 25 years ago. Seriously! At the turn of the century, not everyone used a bug database or ticket tracker. Lots of places had complicated builds to production, with error-prone manual steps.

But question five is still relevant today: Do you fix bugs before writing new code?

hastily3114 19 hours ago

We do this too sometimes and I love it. When I work on my own projects I always stop and refactor/fix problems before adding any new features. I wish companies would see the value in doing this

Also love the humble brag. "I've just closed my 12th bug" and later "12 was maximum number of bugs closed by one person"

troad 17 hours ago

It's fairly telling of the state of the software industry that the exotic craft of 'fixing bugs' is apparently worth a LinkedIn-style self-promotional blog post.

I don't mean to be too harsh on the author. They mean well. But I am saddened by the wider context, where a dev posts 'we fix bugs occasionally' and everyone is thrilled, because the idea of ensuring software continues to work well over time is now as alien to software dev as the idea of fair dealing is to used car salesmen.

remus 17 hours ago

> But I am saddened by the wider context, where a dev posts 'we fix bugs occasionally' and everyone is thrilled, because the idea of ensuring software continues to work well over time is now as alien to software dev as the idea of fair dealing is to used car salesmen
This is not the vibe I got from the post at all. I am sure they fix plenty of bugs throughout the rest of the year, but this will be balanced with other work on new features and the like and is going to be guided by wider businesses priorities. It seems the point in the exercise is focusing solely on bugs to the exclusion of everything else, and a lot of latitude to just pick whatever has been annoying you personally.
- ozim 15 hours ago
  
  That’s what we have fix anything Friday for.
  The name is just an indication you can do it any day but idea is on Friday when you are at no point to start big thing, pick some small one you want to fix personally. Maybe a big in product maybe local dev setup.
pjmlp 17 hours ago

That is why I stand on the side of better law for company responsibilities.
We as industry have taught people that broken products is acceptable.
In any other industry, unless people are from the start getting something they know is broken or low quality, flea market, 1 euro shop, or similar, they will return the product, ask for the money back, sue the company whatever.
- zelphirkalt 14 hours ago
  
  There should be better regulation of course, but I want to point out, that the comparison with other industries doesn't quite work, because these days software is often given away at no financial cost. Often it costs ones data. But once that data is released into their data flows, you can never unrelease it. It has already been processed in LLM training or used somehow to target you with ads or whatever other purpose. So people can't do what they usually would do, when the product is broken.
  - nananana9 12 hours ago
    
    "Free" software resulting in your data being sold is the software working as intended, it's orthogonal to the question of software robustness.
    Software isn't uniquely high stakes relative to other industries. Sure, if there's a data breach your data can't be un-leaked, but you can't be un-killed when a building collapses over your head or your car fails on the highway. The comparison with other industries works just fine - if we have high stakes, we should be shipping working products.
    
    1718627440 12 hours ago
    
    "Free Software" means something, please consider using the terms gratis software or freeware/shareware instead.
- k4rli 14 hours ago
  
  Imagining that the software will be shipped with hardware, that has no internet access and therefore cumbersome firmware upgrades, might be helpful. Avoiding shipping critical bugs is actually critical so bricking the hardware is undesirable.
  Example: (aftermarket) car headunit.
  - zeroCalories 12 hours ago
    
    This type of testing is incredibly expensive and you'll have a startup run circles around you, assuming a startup could even exist when the YC investment needs to stretch 4x as far for the same product.
    The real solution is to have individual software developers be licensed and personally liable for the damage their work does. Write horrible bugs? A licencing board will review your work. Make a calculated risk that damages someone? Company sued by the user, developer sued by the company. This correctly balances incentives between software quality and productivity, and has the added benefit of culling low quality workers.
    
    hiAndrewQuinn 4 hours ago
    
    You don't need formal licensing for this to work, passthrough liability would do plenty. The real sign of success is whether an insurance industry sprouts up to protect software engineers, just like doctors.
    
    pjmlp 11 hours ago
    
    The kind of relates to proper Engineering titles, unfortunely many countries don't have a legal system in place for those that decide to call themselves engineers without going through the exam, and related Order of the Engineer.
    
    zeroCalories 8 hours ago
    
    I don't think titles are for anything besides establishing blame. If a company hires someone in a local where the engineer can't be held responsible, the executives and major investors should be held liable. That way things will naturally sort themselves out. Need something unimportant done? Offshore. Have some critical system? Hire someone that can take responsibility.
    
    pjmlp 8 hours ago
    
    As we say back home, responsability should never die alone.
    
    zeroCalories 6 hours ago
    
    The easiest way to get away with murder is to split the blame such that no individual can be pointed to.
alansaber 12 hours ago

A company creating the conditions that allow for high quality engineering has always been the exception, not the norm

lalitmaganti 10 hours ago

Author here! Really glad to have sparked a lively discussion in the comments. Since there is so many threads since I last looked at this post, making one top level comment to provide some thoughts:

1) I agree that estimating a bug's complexity upfront is an error prone process. This is exactly why I say in the post that we encourage everyone to "feel out" non trivial issues and if it feels like the scope is expanding too much (after a few hours of investigation), to just pick something else after writing up their findings on the bug.

2) I use the word "bug" to refer to more traditional bugs ("X is wrong in product") but also feature requests ("I wish X feature worked differently"). This is just a companyism that maybe I should have called out in the post!

3) There's definitely a risk the fixit week turns into just "let's wait to fix bugs until that week". This is why our fixits are especially for small bugs which won't be fixed otherwise - it's not a replacement for technical hygiene (i.e. refactoring code, removing dead code, improving abstractions) nor a replacement for fixing big/important issues in a timely manner.

danielbarla 10 hours ago

Very interesting post, thank you!
I'd also be curious to know the following: how many new errors or regressions were caused by the bug fixes?

julianlam 20 hours ago

We did this ages ago at our company (back then we were making silly Facebook games, remember those?)

It was by far the most fun, productive, and fulfilling week.

It went on to shape the course of our development strategy when I started my own company. Regularly work on tech debt and actively applaud it when others do it too.

yujzgzc 6 hours ago

There are really two kinds of "small bugs".

1) Things that have existed in your product for decades and haven't been major strategic issues.

2) Things that arose recently in the wake of launches. This can be because it's hard to fix every corner case, or because of individuals throwing sloppy code over the wall to look like they "ship fast".

I try to hold the team to fix bugs (2) quickly while their memory is fresh as it points to unwanted regressions.

The bugs in (1) are more interesting. It's a bit sad that teams kinda have to "sneak that work in" with fixit weeks. I have known of products large enough to be able to A/B test the effects of a quarter's worth of "small fixes", and finding significant gains in key product metrics. That changed management's attitude with respect to "small fixes" - when you have a ton of them, they can produce meaningful impact worthy of strategic consideration, not just a week of giving the dev team free rein to scratch their itch.

fourseventy an hour ago

I like Linears approach to bugs: https://linear.app/now/zero-bugs-policy

xnx 20 hours ago

I've never understood why bugs get treated differently from new features. If there was a bug, the old feature was never completed. The time cost and benefits should be considered equally.

sb8244 20 hours ago

If the bug affects 1 customer and the feature affects the rest, is the old feature complete?
It's not binary.
- zelphirkalt 12 hours ago
  
  Yet engineers are pushed to give unknowable estimates of points and when things take "longer" (did you notice that shift right there?) they are either overdue, taking too long, or they don't, and to say: "It takes as long as it takes." is not accepted by middle management.
  - sb8244 8 hours ago
    
    That's a strawman. It is not really related to the main point and I'm not sure of the point you're trying to make (maybe that tension exists?)
    Obviously things take as long as they take. I've always been an educator of this back to the business leadership. In my experience, most business people truly have no freaking clue how a product gets built and code gets shipped.
    Giving proactive updates (meaning not the day it was expected to be done according to last update) are important part of a professionals working life. There's always a tension between business and engineers. Engineers just generally don't do that well with tension and try to minimize it, or complain about it.
    
    Capricorn2481 3 hours ago
    
    I read them as agreeing with you, but pointing out that fixing bugs, at the end of the day, is up to the client.
    It's just a predictable dance. You say something will take this long, then you find a bug. You point it out to the client, and they get mad at you because your estimate was off. They try to pressure you into fixing the bug for free, whether you were even around when it was made.
    Eventually you just make a judgement call about bugs every time you run into them.
xboxnolifes 18 hours ago

Because the goal of most businesses is not to create complete features. There's only actions in response to the repeated question of "which next action do we think will lead us to the most money"?
klodolph 20 hours ago

Bugs can get introduced for other reasons besides “feature not completed”.
superxpro12 20 hours ago

until we develop a way for MBA's with spreadsheets to quantify profit/loss w.r.t. bugs, it will never be valued.
- lapcat 19 hours ago
  
  The solution is to never hire an MBA.
  - pixl97 8 hours ago
    
    'Why are we getting bought out by a company that cut corners and hired MBAs, and then fired?'

stevage an hour ago

Surprising that no bug should take more than 2 days, yet most developers fixed only 4 bugs in 5 days.

captainkrtek 18 hours ago

A company I worked at also did this, though there was no limits. Some folks would choose to spend the whole week working on a larger refactor, for example, I unified all of our redis usage to use a single modern library compared to the mess of 3 libraries of various ages across our codebase. This was relatively easy, but tedious, and required some new tests/etc.

Overall, I think this kind of thing is very positive for the health of building software, and morale to show that it is a priority to actually address these things.

jll29 15 hours ago

From the report, it sounds like a good thing, for the product and the team morale.

Strangely the math looks such that they could hire nearly 1 FTE engineer that works full time only on "little issues" (40 weeks, given that people have vacations and public holidays and sick time that's a full year's work at 100%), and then the small issues could be addressed immediately, modulo the good vibes created by dedicating the whole group to one cause for one week. Of course nobody would approve that role...

joker99 14 hours ago

The unkind world we live in would see this role being abused quickly and a person not lasting long in this role. For one, in the wrong team, it might lead to devs just doing 80% of the work and leaving the rest to the janitor. And the janitor might get fed up with having to fix the buggy code of their colleagues.
I wonder if the janitor role could be rotated weekly or so? Then everyone could also reap the benefits of this role too, I can imagine this being a good thing for anyone in terms of motivation. Fixing stuff triggers a different positive response than building stuff

inhumantsar a day ago

I firmly believe that this sort of fixit week is as much of an anti-pattern as all-features-all-the-time. Ensuring engineers have the agency and the space to fix things and refactor as part of the normal process pays serious dividends in the long run.

eg: My last company's system was layer after layer built on top of the semi-technical founder's MVP. The total focus on features meant engineers worked solo most of the time and gave them few opportunities to coordinate and standardize. The result was a mess. Logic smeared across every layer, modules or microservices with overlapping responsibilities writing to the same tables and columns. Mass logging all at the error or info level. It was difficult to understand, harder to trace, and nearly every new feature started off with "well first we need to get out of this corner we find ourselves painted into".

When I compare that experience with some other environments I've been in where engineering had more autonomy at the day-to-day level, it's clear to me that this company should have been able to move at least as quickly with half the engineers if they were given the space to coordinate ahead of a new feature and occasionally take the time to refactor things that got spaghettified over time.

lalitmaganti a day ago

As I pointed out in the "criticisms" section, I don't see fixit weeks as a replacement for good technical hygiene.
To be clear, engineers have a lot of autonomy in my team to do what they want. People can and do fix things as they come up and are encouraged to refactor and pay down technical debt as part of their day to day work.
It's more that even with this autonomy fixits bugs are underappreciated by everyone, even engineers. Having a week where we can address the balance does wonders.

bears123 3 hours ago

Teams that implement this, or similar, exercise: how do you handle PR Reviews for fixits, if at all? I'd like to implement, but at a smaller team (8 devs, 3 whom approve PRs) the volume would be so high that the 3 senior devs would likely spend all their time reviewing.

LocalPCGuy 3 hours ago

In the spirit of that exercise, the fixes should not take an excessive amount of time to review. If they are, it's likely either the scope of the fix is too large for that kind of exercise, or the PR review process is too in-depth.
I would also question why only 3 of 8 devs approve PRs. Even if that can't change more broadly all of the time, this kind of exercise seems like a perfect time to allow everyone to review PRs - two fold benefit, more fixes are reviewed and gives experience reviewing to others that don't get to do that regularly.
So yes, definitely still do PRs, and if that is problematic, consider whether that is an indication the PR process may itself need to be reviewed.

knallfrosch 6 hours ago

It's concerning that noone was able to fix these bugs as an "aside".

With an average of 4 bugs fixed in 5 days and 150 bugs, we can assume 50 bugs with less than one days's effort were just lying around with noone daring to touch them.

internet101010 3 hours ago

Microsoft should do this for an entire year. Windows 11 is still a bug-ridden heap of trash.

Cthulhu_ 10 hours ago

We did a bug hunt once at a previous employer, just stop regular work, open the website, and look for issues. We found over a hundred in a day. Stopping your regular work and actively work with your product is a healthy practice. Facebook did (does?) do a thing where once a week they'd throttle the internet so everyone had to experience what things are like for their average users.

dzonga 4 hours ago

I do wonder though if some of those google products were not part of google but independent co's what would happen ?

tracker1 8 hours ago

I've been pushing for things like this for years...

Having every 3rd or 4th sprint being dev initiatives and bugs... Or having a long/short sprint cycle where short sprints are for bugs mostly... Basically every 3rd week is for meetings and bug work so you get a solid 2 weeks with reduced meetings.

It's hard to convince upper managers of the utility though.

caycep 5 hours ago

granted, I feel like fixing bugs should be pre-allocated to the road map adequately, vs. spending 1 giant cycle every 10 years catching up on bug fixes a la snow leopard (cough cough apple)

jchrisa 19 hours ago

I just had a majorly fun time addressing tech debt, deleting about 15k lines-of-code from a codebase that now has ~45k lines of implementation, and 50k lines of tests. This was made possible by moving from a homegrown auth system to Clerk, as well as consolidating some Cloudflare workers, and other basic stuff. Not as fun as creating the tech debt in the first place, but much more satisfying. Open source repo if you like to read this sort of thing: https://github.com/VibesDIY/vibes.diy/pull/582

wredcoll 19 hours ago

I would be weirdly happy to have a role whose entire job was literally just deleting code. It is extremely satisfying.
- cindyllm 19 hours ago
  
  [dead]

klabetron 13 hours ago

I introduced this to my old company years ago and called it Big Block of Cheese Day after the West Wing episode [1]. We mostly focused on very minor bugs that affected a tiny bit of our user base in edgey edge cases but littered our error logs. (This was years ago at a, back then, relatively immature tech company.)

It had the same spirit as a hackathon.

[1] https://westwing.fandom.com/wiki/Big_Block_of_Cheese_Day

philipallstar 12 hours ago

When the world learned the wrong reason for why the Mercator Projection was adopted.

eviks 15 hours ago

> closed a feature request from 2021! > It’s a classic fixit issue: a small improvement that never bubbled to the priority list. It took me one day to implement. One day for something that sat there for four years

> The benefits of fixits

> For the product: craftsmanship and care

sorry, but this is not care when the priority system is so broken that it requires a full suspension, but only once a quarter

> A hallmark of any good product is attention to detail:

That's precisely the issue, taking 4 years to bring attention to detail, and only outside the main priority system.

Now, don't get me wrong, a fixit is better than nothing and having 4 year bugs turn into 40 year ones, it's just that this is not a testament of craftsmanship/care/attention to detail

lalitmaganti 8 hours ago

> this is not care when the priority system is so broken that it requires a full suspension
I'm not sure I understand this line. The whole point of the fixit is to address the bugs which are considered "low priority" because they only appear in a edge case or are not quite 100% perfectly polished but still matter over the long tail of people using the product.
Or do you propose that every issue like this needs to be fixed before doing anything else?

entropie 20 hours ago

I wanted to take a look at some of these bug fixes, and one of the linked ones [1] seems more like a feature to me. So maybe it should be the week of "low priority" issues, or something like that.

I don't mean to sound negative, I think it's a great idea. I do something like this at home from time to time. Just spend a day repairing and fixing things. Everything that has accumulated.

1: https://github.com/google/perfetto/issues/154

mulquin 20 hours ago

To be fair, the blog post does not explicitly say anywhere that the week was for bug fixes only.

tait1 19 hours ago

We’ve done little mini competitions like this at my company, and it’s always great for morale. Celebrating tiny wins in a light, semi-competitive way goes a long way for elevating camaraderie. Love it!

invalidusernam3 11 hours ago

I like the idea of this, but why not just have some time per week/sprint for bugs? At my company we prioritise features, but we also take some bug tickets every sprint (sometimes loads of bug tickets if there aren't many new features ready for dev), and generally one engineer is on "prod support" which means tackling bugs as they get reported

stuartjohnson12 11 hours ago

Because marginal work is only marginally rewarded. Spending one week and coming back to whoever with a nice piece of paper saying we fixed 60 bugs will earn a lot more rope from non-technical folk than fixing 3 bugs per week - the latter just looks like cleaning up your incompetence.
- LocalPCGuy 3 hours ago
  
  I suspect I'm preaching to the choir, but that is a communication issue and a sign the "rewards system" is out of whack, not a "reason" not to push for regular maintenance/tech debt/bug cleanup work.
  It should be understood that there WILL be bugs, that is NOT a sign of incompetence, and so cleaning them up should be an ongoing task so they do not linger and collect (and potentially get worse by compounding with other bugs).

alkonaut 13 hours ago

We once did this for a massive product with 3 releases per year: took a whole cycle to do zero features, and just fix bugs. Internal customers who usually stepped over themselves to get their latest feature in the program, were accepting it. But we had to announce it early. Otherwise the usual consensus is that customers would rather take 1 feature together with 10 new bugs, than -5 bugs and no new features.

PeterStuer 16 hours ago

Confused about the meaning of "bug" used in this artcle. It seems to be more about feature requests, nice to haves and polish rather than actual errors in edge cases.

Also explains the casual mention of "estimation" on fixes. A real bug fix is even more hard to estimate than already brittle feature estimates.

radiator 11 hours ago

It is good to fix bugs, but in my team we need neither the "points system” for bugs nor the leaderboard showing how many points people have. We are against quantifying.

molly0 8 hours ago

We had a quarter, where each Monday we spent most of the day fixing bugs. It greatly improved the product.

franciscator 5 hours ago

Good done, otherwise technical debt will have stopped you

neilv 19 hours ago

> We also have a “points system” for bugs and a leaderboard showing how many points people have. [...] It’s a simple structure, but it works surprisingly well.

What good and bad experiences have people had with software development metrics leaderboards?

watty 11 hours ago

I've never had a good experience with individual metrics leaderboards. On one team we had a JIRA story point tracker shown on a tv by a clueless exec. Devs did everything they could to game the system and tasks that required uncertainty (hard tasks) went undone. I believe it contributed to the cog culture that caused an exodus of developers.
However, I love the idea of an occasional team based leaderboard for an event. I've held bug and security hackathons with teams of 3-5 and have had no problem with them.

Cedricgc 19 hours ago

One nice thing if you work on the B2B software side - end of year is generally slow in terms of new deals. Definitely a good idea to schedule bug bashes, refactors, and general tech debt payments with greater buy in from the business

brightball 10 hours ago

Getting ready to do a December “Bug Smash” based on the model in the book Shape Up. Whole team has been eagerly awaiting it for months.

briandrum 7 hours ago

Shape Up was my first thought too. I just left a team where I introduced cycles of six weeks of feature development and two weeks of bug fixing, tech debt, and anything else the developers decided to tackle.
It depends on the stage and size of your team and company of course, but for us the result was more predictable delivery and happier, more-engaged developers.
For anyone curious to learn more: https://basecamp.com/shapeup/2.2-chapter-08#cool-down

Ethan312 19 hours ago

Focused bug-fixing weeks like this really help improve product quality and team morale. It’s impressive to see the impact when everyone pitches in on these smaller but important issues that often get overlooked.

q2dg 5 hours ago

Systemd should do this too

siliconc0w 18 hours ago

I'm a bit torn on Fix-it weeks. They are nice but many bugs simply aren't worth fixing. Generally, if they were worth fixing - they would have been fixed.

I do appreciate though that certain people, often very good detail oriented engineers, find large backlogs incredibly frustrating so I support fix-it weeks even if there isn't clear business ROI.

forgotoldacc 18 hours ago

> Generally, if they were worth fixing - they would have been fixed.
???
Basically any major software product accumulates a few issues over time. There's always a "we can fix that later" mindset and it all piles up. MacOS and Windows are both buggy messes. I think I speak for the vast majority of people when I say that I'd prefer they have a fix-it year and just get rid of all the issues instead of trying to rush new features out the door.
Maybe rushing out features is good for more money now, but someday there'll be a straw that breaks the camel's back and they'll need to devote a lot of time to fix things or their products will be so bad that people will move to other options.
- foxygen 17 hours ago
  
  Oh boy, I’d trade one(or easily 2/3) major MacOs version for a year worth of bug fixes in a heartbeat.
  - Barbing 17 hours ago
    
    You got it per Gurman:
    >For iOS 27 and next year’s other major operating system updates — including macOS 27 — the company is focused on improving the software’s quality and underlying performance.
    -via Bloomberg today
    
    baq 16 hours ago
    
    I’ll believe it when I see it, but holy quality Batman I want to believe.
    
    Lionga 16 hours ago
    
    how will the poor engineers get promotions if they can not write "Launch feature X" (broken, half baked) on their promotion requests? Nobody ever got promoted for fixing bugs or keeping software useable.
saghm 18 hours ago

A greedy algorithm (in the academic sense, although I suppose also in the colloquial sense) isn't the optimal solution to every problem. Sometimes doing the next most valuable thing at a given step can still lead you down a path where you're stuck at a local optimum, and the only way to get somewhere better is to do something that might not be the most valuable thing measured at the current moment only; fixing bugs is the exact type of thing that sometimes has a low initial return but can pay dividends down the line.
baq 16 hours ago

ROI is in reduced backlog, reduced duplicate reports and most importantly mitigation of risk of phase transition between good enough and crap. This transition is not linear, it’s a step function when the amount of individually small and mildly annoying at worst issues is big enough to make the experience of using the whole product frustrating. I’m sure you can think of very popular examples of such software.

nottorp 10 hours ago

So normally they don't fix bugs before adding feature bloat?

boxed 16 hours ago

I did this with my entire employment at a company I worked with. Or rather, I should say I made it a point to ignore the roadmap and do what was right for the company by optimizing for value for customers and the team.

Fixit weeks is a band aid, and we also tried it. The real fix is being a good boss and trusting your coworkers to do their jobs.

OhMeadhbh 19 hours ago

How did you not get fired?

kangs 19 hours ago

hello b/Googler :)

riwsky 18 hours ago

So much of the tech debt work scheduling feels like a coordination or cover problem. We’re overdue for a federal “Tech Debt Week” holiday once a year, and just save people all the hand-wringing of how when or how much. If big tech brands can keep affording to celebrate April fools jokes, they can afford to celebrate this.

fsniper 13 hours ago

I feel odd about "bug fixing" to be a special occasion than being the work. Features need to be added, so do bugs need to be fixed. Making it a special occasion makes it feel like some very low priority "grunt work" that requires a hard push to be looked at.

j45 18 hours ago

Fixing bugs before new code can shed interesting lights on how a dev team can become more effective.

hermitcrab 6 hours ago

Alternative title: "We only fix bugs for one week every quarter".

ls-a 20 hours ago

189 bugs in one week. How many employees quit after that?

asdfman123 20 hours ago

They said they only pick bugs that take 2 days to fix.
Places where you can move fast and actually do things are actually far better places to work for. I mean the ones were you can show up, do 5 hours of really good work, and then slack off/leave a little early.
- kykat 19 hours ago
  
  Too bad many places care more about how long you stay warming the seat than how useful the work done actually is.
- ls-a 18 hours ago
  
  Nothing takes 2 days to fix. Those are definitely not bugs, like someone else mentioned
  - toast0 18 hours ago
    
    You haven't seen the same kind of bugs I have, I guess.
    This kind of thing takes more than 2 days to fix, unless you're really good.
    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=217637
    Or this one
    https://security.stackexchange.com/questions/104845/dhe-rsa-...
    I can find more of these that I've run into if I look. I've had tricky bugs in my team's code too, but those don't result in public artifacts, and I'm responsible for all the code that runs on my server, regardless of who wrote it... And I also can't crash client code, regardless of who wrote it, even if my code just follows the RFC.
    
    ls-a 17 hours ago
    
    That's what I'm saying. Nothing takes 2 days to fix meaning it takes more time
    
    toast0 16 hours ago
    
    Oh. Well, I've done easy fixes too. There's plenty of things that just need a couple minutes, like a copy error somewhere.
    Or just an hour or two. I can't find it anymore, but I've run into libraries where simple things with months didn't work, because like May only has three letters or July and June both start with Ju. That can turn into a big deal, but often it's easy, once someone notices it.
    
    ls-a 11 hours ago
    
    If your goal is to fill green github squares then fine
  - asdfman123 6 hours ago
    
    I'm sure it has a lot to do with the complexity of the environment but I've fixed three bugs in a day easily.
    Our software isn't serving millions of people though, it's a cli tool with a few hundred end users.
Normal_gaussian 20 hours ago

189 presumably

flakiness 19 hours ago

FYI, this article describes how traditional Google fixit was conducted: https://mike-bland.com/2011/10/04/fixits.html

heyitsdaad 17 hours ago

False sense of accomplishment.

Doing what you want to do instead of what you should doing (hint: you should be busy making money).

Inability to triage and live with imperfections.

Not prioritizing business and democratizing decision making.

ocimbote 17 hours ago

You criticize the initiative because you judge it doesn't have impact on the product or business. I would challenge the assumption with the claim that a sense of acconplishment, of decision-making and of completion are strong retention and productivity enhancers. Therefore, they're absolutely, albeit indirectly, impacting product and business.
snovv_crash 15 hours ago

Just because you can't measure the loss of customers who are turned off by your buggy product doesn't mean they don't exist.