Heisenbug, noun
Oct. 29th, 2013 12:55 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
"A computer error that goes away every time you look at it closely, making it difficult to diagnose."
As, for example, spending the evening trying to figure out why your Internet connection has become flaky -- only to realize in the morning that the cablemodem is plugged into the light switch. D'oh...
As, for example, spending the evening trying to figure out why your Internet connection has become flaky -- only to realize in the morning that the cablemodem is plugged into the light switch. D'oh...
(no subject)
Date: 2013-10-29 05:17 pm (UTC)The resulting bug was a nightmare to find.
(no subject)
Date: 2013-10-29 06:33 pm (UTC)I lost over a *month* to that goddamn bug, and never did get Microsoft to admit that it was their fault, but my online research basically came to the conclusion that I'd hit a bug deep inside the OS, specifically in the sound libraries.
I don't remember the full details, but when you started using this particular media library, it would begin a timer in the kernel that would occasionally wake up, scan an internal linked list, pop the first thing from the list and add a new one to the end. Problem was, if you killed the process without properly closing that library (because the program crashed or -- as happened all the time for us -- if you stopped it in the debugger), it would stop *removing* things from the list, but would keep *adding* to it. This only happened a relatively few times per second, but eventually added up to scanning a linear linked list that was *millions* of elements long, several times a second. And of course, since it was in the kernel, the OS didn't think anything at all was happening, but it would begin to miss OS events like mouse movement and clock ticks since it was spending all its time scanning the list.
It was an "anti-Heisenbug" in that, as I eventually figured out, it *only* happened if you were debugging. Fixing it in the shipping product was trivial: I just made sure that we had an outer exception catch, and always shut down that damned Windows library properly. But even if we hadn't fixed it, I suspect it would never have been noticed in the field. It was aggravating to realize that the bug only really showed up when debugging, and that the only solution was, really and truly, to just reboot eff'ing Windows every now and then.
(Of course, this was around the time that we found out that it was physically impossible to run Windows for more than two months before an internal timer would roll over and crash the OS. Far as we could tell, nobody had ever hit this because it was nearly impossible to successfully run Windows for that long...)
(no subject)
Date: 2013-10-29 06:36 pm (UTC)(no subject)
Date: 2013-10-29 06:51 pm (UTC)As for Apple, it was the usual story: Apple's "my way or the highway" attitude turned a lot of folks off, and the variety of hardware and software available for it was a small fraction of what you could do with Windows. It was basically a comfortable high-end niche player.
So basically, there was no competition to speak of -- indeed, that "sheer force of inertia" was much, much stronger at that point. (*Now*, Windows is in serious trouble, because iOS and Android have badly disrupted the assumptions underlying it.) Everyone know that Windows was crappy (certainly all the programmers did), but it was what the consumers had, so we targeted it. And since all the programs ran on Windows, all the consumers bought it. It was very sweet for Microsoft, for quite a while...
(no subject)
Date: 2013-10-30 12:14 am (UTC)At work we run into a Windows "bug" (they call it a feature) that means that network access is 3x - 20x slower on Windows than it needs to be, as evinced by the speed on the same box running Linux. Normal people don't find it painful enough to jump ship, however.
Admittedly this is much better under Windows 7 than it was under XP, where it was 100x-300x slower.
(no subject)
Date: 2013-10-30 09:40 pm (UTC)Of course, the big difference was that a lot of people knew about this problem, because otherwise you could keep the system up indefinitely.
(no subject)
Date: 2013-10-29 06:44 pm (UTC)OLD HACKER MODE ON
I was working on a multitasking implementation of DOS, and we found that the system would suddenly crash-and-hang the OS on rare occasions. Our original hint was that the sound was always on, and while we weren't ALWAYS typing, we frequently were typing.
Back in the day when programming to DOS on an IBM PC, it was well understood that the BIOS calls were safely re-entrant. Not only were they, but IBM had actually published the code for the original BIOS to allow users to rely on its behavior.
It turns out that this BIOS was not sufficiently re-entrant - it had allowed itself to be called up to two-deep, but if it was called 3 deep, that was a problem. The developer was likely sure that could not happen. But that product I was using included some very smart boards that slotted into the data bus, and indirectly it could cause its own BIOS re-entrant calls.
Also, as it turns out, if you REALLY tried hard, you could make 3 levels of BIOS calls without using an external Board, although why would you?
I wrote a lovely assembly language program that would do just that - make 3 level deep calls into the BIOS. And my Heisenbug became a bug.
Alas: when I called the OS vendor, they sort of couldn't understand what I was talking about. That code had been written by a contractor who was long-gone, and they knew they didn't understand it, and weren't going to fix it. They offered to fly me to their offices for a week to fix it (and would pay me and my employer a bounty to do so), but my employer wouldn't bite.
So, we dumped that OS and went on to another. My 3-deep tool became a standard acceptance test for evaluating OS's.
(no subject)
Date: 2013-10-29 06:56 pm (UTC)I confess to being morbidly curious what they were doing to allow level 2 re-entrancy but not level 3 -- that smells like somebody hacked something horribly...
(no subject)
Date: 2013-10-29 07:01 pm (UTC)Remember back in those days (and even now with embedded devices), space cost money and no one used space they didn't have to.
(no subject)
Date: 2013-10-29 07:07 pm (UTC)Fun times: it was my first experience of "they didn't tell me it was impossible when I got the assignment"...
(no subject)
Date: 2013-10-30 11:50 am (UTC)(no subject)
Date: 2013-10-30 12:44 pm (UTC)(no subject)
Date: 2013-10-30 02:21 pm (UTC)Plus 95% of my sweethearts have been techies...
(no subject)
Date: 2013-10-29 09:57 pm (UTC)(no subject)
Date: 2013-10-29 10:11 pm (UTC)But even there, I'm still dependent on the system libraries not sabotaging things from the get-go...
(no subject)
Date: 2013-10-30 12:11 am (UTC)Often heisenbugs are the result of bogon fields interacting poorly with the hardware and/or software, which can be extra hard to diagnose--even taking into account the laws of quantum chromobogodynamics. Glad this one was easier to track down...