Never trust your reference count!
published on Saturday, May 14, 2016
One of the cool things about programming is finding bugs. Personally, I prefer bugs that catch me by surprise or display astonishing symptoms. Bugs can be particularly refreshing if your own code is at fault, possibly that one seemingly well-thought-out piece that you'd think to be fine - until reality catches up and teaches you otherwise. The permanent testimony of your own inabilities helps keeping you humble - a feeling that I'd like to share with you by telling you about the sweetest cherries I find. This is my first post of this kind, about a bug I encountered this week.
The program
Consider the following simple program that shows a dialog with a single button and quits the application once the button is clicked. You will need python-gobject to run the example. Both python 2.7 and 3.5 should work fine:
The program appears simple enough - so what's the problem?
If you click the button as soon as the window appears, the program will usually exit as intended - but if you wait a few seconds without clicking, the program will do nothing more in reply. Try it out! (You have to wait about 3s on py2 and about 10s on py3)
Wait, wat? Why would this happen?
Debugging
It looks like either button_clicked is not executed or the list of callback handlers is cleared. My initial thought was even that the reference count of the callback handler (here mainloop.quit) may have been somehow decreased, leading to its destruction and therefore preventing its execution (the background work was done in a C library which I did not ultimately trust).
Let's investigate: Insert a few debugging statements in Dialog, strip do_some_work to its essential functionality and execute it only once. The following modified code will make clear what's going on:
Run the program and immediately click the button. You get, as expected:
Now wait 2s for the timer to fire:
Wow, the dialog is dead! Panicking, you repeatedly click the button:
The dialog has zombified. It can still act, but its former memory, its personality is lost.
Diagnosis: Death
All that's left to do is find someone to blame. How is it possible that the dialog is still visible, but left to die?
Apparently, the PyGI GTK binding does not hold a reference to the Gtk.Dialog object. Neither does it increase the reference count of the button-"clicked" signal handler which it does remember to call. In consequence, the dialog is only kept alive by a cyclic reference between the dialog and the signal handler itself (the handler is a bound method and therefore stores a reference to the dialog object). The garbage collector can detect such a cycle and delete both objects.
You ask why? Bad design, I'd say. I've seen similar behaviour in other components, such as libnotify bindings. The result is always that some callback is executed which belongs to a dead python object.
But why did the dialog come back to live and behave almost as back in the days - instead of e.g. just crashing the program? I guess that's just coincidence that the corresponding memory was not overwritten yet.
Bugfix
How can you fix the problem in your code? The answer is simple and unsatisfying: Just add a global reference to the dialog object:
And don't forget self._INSTANCES.remove(self) when getting rid of the dialog to allow cleaning it up.
The lesson
- Never trust third-party libraries to increase the reference count of something, just because they use it later on.
- Even familiar bugs can appear in unknown varieties, keeping them fresh and fun.
- Don't get lulled into a false sense of security that dead things stay dead!