Fast threads

Recently I was sitting in some meeting or other and thinking deep thoughts about threads. (Either that or falling asleep.) I thought about OpenMP and the short-lived job assignments that are handed out to threads. Then I thought about all the megabytes of memory allocated to the stacks for all those threads, who were likely ever only going to have one stack frame.

I wondered if it would make sense for a chip architecture to provide a way to create a light-weight register context (that is, a thread) without any specific stack assoicated with it.  It’s not really so much a question of the CPU instructions to support it, that parts pretty simple. The bigger question is the OS support.

For sparc chips, the OS needs to have a stack as the place to spill register windows to, and so the sparc ABI requires that one of the local registers in a register window always points to the spill location for the registers in that window. (In other words, the frame pointer).

But the compiler can identify OpenMP loop bodies that don’t make any subroutine calls (at least after inlining and intrinsic expansion). For such a loop body, there aren’t going to be any register windows that need spilling, except one that only need spilling if the thread gets swapped out.

If the chip and the OS had special support for such light weight threads, then it seems like you’d be able to create and tear down gangs of OpenMP threads very quickly.  There are OpenMP intrinsics and library calls that do things like read the current thread_id. Obviously you’d have to disable this kind of light-threads optimization if the code tried to perform any libthread-style operations on the threads inside the OpenMP body.

Oh well, it was just a thought that occurred to me. I’m sure it’s been done before someplace or other.  But it only seems like it would pay off for the kind of applications that I call “numeric” applications. (In other words, applications that are amenable to HPC / HPTC computing environments).