Lately I have become quite interested in garbage collection algorithms Since garbage collection massively affects code performance, there have been quite elaborate garbage collection algorithms devised in the JVM. What initially sparked my interest was the mmtk toolkit and the LXR paper.
I decided to dive deeper into the topic and consulted the Garbage Collection Handbook 1, which gives a very nice and structured overview of the basic algorithms and their performance characteristics.
I also went through a few papers related to GC two notable ones: - A unified Theory of Garbage Collection 2, which investigated the “dualism” between garbage collection and reference counting - On-the-Fly Garbage Collection: An Exercise in Cooperation 3, which is a short but succinct paper on concurrent garbage collector (related to Golang’s GC implementation)
One of my main goals was to implement a real garbage collector in a real interpreter. What I found out is that the PyPy interpreter has been adapted to support plugable garbage collectors, hence I decided to implement a simple garbage collector in PyPy.
Choose the simplest garbage collection algorithm first, so you can familiarize yourself with the code base. From my POV the most basic algorithm is the mark-and-sweep garbage collection.
The algorithm is quite simple: 1. Mark: Traverse the whole object graph (tracing) from all root objects and mark every reachable object 2. Sweep: Examine every heap object and free object if not marked as reachable
So here’s condensed version of the implementation steps and issues I
found along the way. Since the PyPy test suite was quite helpful I did
some kind of TDD for the implementation. There is already a few GC
implementations in the interpreter and the SemiSpaceGC
seems to be the simplest one. This gave me a good starting point for the
mark-and-sweep.
Starting from what I have seen in SemiSpaceGC
I created
a class that derives from the base classes and added a test
Unfortunately the basic
setup produced an
AttributeError: 'SimpleMarkSweepGC' object has no attribute 'HDR'
when executing the tests. I guess the interface in the
GCBase
could be improved to provide some more guidance on
what the interface needs. We will run into some more of that later
on.
The solution was to do what all the other gc implementations are also
doing, so I looked the semispace gc to find out what to do: Each GC
class defines a header struct to save info on garbage collection,
HDR = lltype.Struct('header', ('tid', lltype.Signed))
After
a few more AttributeErrors I ended up with this
minimal setup
From inspecting the rest of the gc implementations it seemed that
malloc_varsize
/ malloc_fixedsize
create new
objects that are managed by the GC, the collect
methods
runs a major collection of the GC and
init_gc_object_immortal
usually sets a flag for immortal
objects (called
from consider_constant
).
Having the basic structure in place, I could implement the first
version of the GC (I guess you could call it a NoGC similar to
mmtk) that allocates memory via the raw_malloc
function. Using raw_malloc
is in contrast to what the
semispace GC does since most of the GC implementations in PyPy are using
arena based allocation. The
simple implementation already satisfies 20 / 45 tests. After the
basic setup we can finally look into implementing the mark and sweep
phases.
We need to track all the objects on the heap and introduce an
AddressDeque
which holds the references for that. (see also
HybridGC
for a similar implementation) Every time we
allocate we can add the address to our deque and in the sweep phase we
can iterate through all the references.
We use DFS to mark surviving objects from the roots. Note: It is quite hard to follow where you are dealing with addresses vs pointers vs objects in rpython… This is the resulting implementation
Most tests that are failing are related to finalizers / weakrefs / collections.
Weak references are references that don’t affect the “reachability” of an object. The weak reference can either be valid (pointed-to object is alive) or invalid (pointed-to object has been collected).
Similar to other implementations We maintain list of objects that contain weakrefs. then for each object that contains a weakref we check if the object it is pointing to survives. If it does not survive, we invalidate the weakref (setting to NULL), otherwise we “keep” it.
A finalizer is a function that runs when an object is about to be
collected. In PyPy there’s essentially two types of finalizers: - The
__del__
method (set via
malloc_fixedsize(..., needs_finalizer=True)
- FinalizerQueues
which have to be explicitly instantiated
Dealing with finalizers is not straight-forward since finalizers can revice objects and you need to be careful with the execution order.
As in the other GCs we will keep track of both types of finalizers in
the same queue objects_with_finalizers
. Each entry in the
queue is essentially a pair (object, finalizer_queue_index)
del method gets index -1, finalier queues get a proper
index.
We need to sweep before executing the finalizers, otherwise a finalizer might introduce new objects that are not marked as surviving by default The resulting code is part of this commit.
The remaining test cases were all related to starting a collection while another collection is already going on (e.g. via the infamous finalizers…)
=========================== short test summary info ============================
FAIL rpython/memory/test/test_marksweep.py::TestSimpleMarkSweep::()::test_finalizer_calls_malloc
FAIL rpython/memory/test/test_marksweep.py::TestSimpleMarkSweep::()::test_finalizer_calls_malloc_during_minor_collect
FAIL rpython/memory/test/test_marksweep.py::TestSimpleMarkSweep::()::test_collect_during_collect
To prohibit “nested” garbage collection, we introduce a lock (in this case a simple variable since the implementation is not concurrent) so that no collection can happen while another collection is taking place.
At the end of the day the Mark & Sweep algorithm is not a complex algorithm, but implementing it in a real codebase was more challenging than expected since I had to dive deep into the inner workings of the PyPy interpreter and RPython. Thankfully the PyPy repsoitory contains a great test-suite to validate the implementation.
Finalizers make your life harder (at least with regards to GC) since
finalizers can e.g. revive objects, so extra steps need to be taken
before freeing memory. There’s no finalizers in Golang or Java (at least
since Jav SE9, for a good explanation why that’s the case read the
deprecation notice for the Object.finalize
method)
It was quite fascinating to dive deeper into the implementation of PyPy and to look into the usage of RPython as a low-level implementation language. I’m quite interested in building a small interpreter for some toy language to play around with RPython it a little more.
My next GC project will probably be a compacting collector.
Jones, Richard, Antony Hosking, and Eliot Moss. The garbage collection handbook: the art of automatic memory management. CRC Press, 2023.↩︎
Bacon, David F., Perry Cheng, and V. T. Rajan. “A unified theory of garbage collection.” Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications. 2004.↩︎
Dijkstra, Edsger W., et al. “On-the-fly garbage collection: An exercise in cooperation.” Communications of the ACM 21.11 (1978): 966-975.↩︎