Have you tested this GCs performance? Sometimes a baby GC can be fast enough.
I haven't measured the performance. I would like to. I'm especially interested in comparing it with the current version of the collector. It is now capable of heap compaction which will be the subject of a future article. I'm also interested in knowing how big a problem the conservative part of the collector is. The C stack depth has been minimized significantly since I got rid of the recursive evaluator. I don't think it's a significant problem but I could be wrong.
I need to implement the compiler's profiling functions in lone. Documentation on the matter seems scarce. Not even sure if those instrumentation APIs are stable. I got away with not implementing the ASAN and stack protector since they added flags that make the compiler emit trapping instructions instead of function calls. Profiling is a lot more complex.
Re forcing a register spill: assuming the GC is invoked via an ABI-compliant function call, you don’t actually need to save all the scalar registers manually, only the callee-save ones (as setjmp does). Alternatively, you can make the compiler do the spilling by inserting a function preserving no registers into the call chain—this is spelled __attribute__((preserve_none)) in sufficiently new Clang or GCC, but you can also accomplish this kind of thing on Watcom with #pragma aux, for example.
Re obtaining the top and bottom of the stack: in addition to __builtin_frame_address, there’s also the age-old option of declaring a local and looking at its address, as long as strict aliasing doesn’t bite you. And if you know you’re running on Linux or something sufficiently compatible (as seen from the reference to auxv), you can treat argv as the bottom of the stack for the main thread, as the startup stack on Linux is (IIRC) argv, then the argument strings, then the environment, then the auxiliary vector.