Sunday, November 21, 2004

Hidden dependencies

If you've been following my entries over the past week you'll know I've just finished a release. Burned to CD, nicely labelled and dispatched by horny-handed messenger to the corners of the earth.

A couple of days later I installed the release on a clean machine (as I'd done each morning during the release process) but this time I exercised a dialog box I hadn't exercised during the release testing. And dammit, it didn't even appear. Hmmmm. So I go to my development machine running the same code and the dialog box appears. Just to be sure, I installed the release on another clean machine and no dialog box. Actually there were two dialog boxes that didn't appear, but other dialog boxes did appear. The two that failed to appear both use the ListView control. The ones that appeared only use standard pre Win95 controls (edit boxes, buttons etc).

Ok, let's be methodical about this. It works on my development box but not on a clean box. I decided to install the development environment on the clean machine and single step through the code to work out where it's going wrong. So an hour later I've installed VC6, copied the source code from SourceSafe and I'm ready to test. But just to be sure, I run the app again outside the development environment and this time the dialog box appears! WTF?

Hopefully it's obvious that the mere act of installing VC6 (without even running it) fixed the problem. So how does installing VC6 change the system? Well for one thing, it installs and registers a bucketload of OCX's. And, to cut a long story short, that was the problem. This MFC app uses the VB wrapper OCX around COMCTL32.DLL. I ran the app on my development machine under the debugger, waited until all the 'loading *.dll' messages died down and then opened the dialog box. Up pops MSCOMCTL.OCX in the list. Aha!! So I copy that OCX from my development machine to the clean machine (the second one without the development environment) and register it. Voila!! The dialog opens just fine. Unregister the OCX and, of course, the dialog fails.

How did this situation arise in the first place? I inherited this code from another developer who told me he'd started the app in VB6 but couldn't get the performance required so he migrated it to MFC. Along the way he'd taken a control he was familiar and comfortable with and used it instead of learning how to use ListView directly. Well ok, that explains how this MFC app ended up using a VB wrapper OCX. But how come it's taken until now (around 2 years) for the problem to surface?

My predecessor's method of installing the software was to install the development system on each new system (I should note that we have a very restricted audience for this software - it's part of a half a million dollars production machine). Then he'd copy the latest source code from his laptop and compile it in place. His average time to achieve all this was at least a day.

When I joined the company I was aghast at this. I set myself the goal of producing installation CD's and bringing up a new machine in less than 30 minutes. Achieved it too; if you're upgrading an existing machine. Ah well, next months release will contain the OCX so it'll work properly on virgin hardware too.

And now for the mea culpa. It's obvious that my release testing isn't thorough enough. From now on release CD tests will include exercising every dialog box. But methinks I need to test much more than I've been testing. Much thought is required.

No comments: