We need to get the optimizer more involved with thread creation and synchronization.  Today the optimizer group is very focused on OpenMP programs, and less focussed on trying to optimize pthreads-style programs.  (Of course,we have lots of OpenMP customers too, and it’s their judgment call.)

I was reading some slides from a talk about Transactional Parallel Programming by Kunle Olukotun, and (as often happens) my mind started wandering. I recalled a description of a recent optimization in the JVM related to synchronization performance. (Some cool synchronization optimizations in the Mustang JVM are described here.)

The idea I’m thinking about (biased locking) is used in the JVM was to assign each object a thread-id, when the object is first used by a thread.  Then as long as the same thread is the only accessor for the object, you don’t have to synchronize it.  You only have to make sure that if another thread starts to access the object, you have a (potentially expensive) way to convert the object to a “must synchronize for real” object.

This results in a low-level ubiquitous way to optimize synchronization for the case of non-shared objects.  The theory is that this helps because the majority of objects in the average (well-designed) program are short-lived and localized to a single thread.

It occurred to me that we could do the same for C++ programs, for objects that have built-in synchronization (for example from the STL). The key part is that each object access must check the status of the object before the access can go forward, to distinguish “synced” objects from “nonsynced” objects.  Obviously, for small accessors you don’t want to pay this cost.  These checks can be optimized (hoisted upwards) as long as the optimizer can prove that the thread will not change in between the upper layer and the lower layer.

For example, A() calls B().  Inside each function is a call to a method on an object ( for example: X->Y() ).

A() {  
   X->Y(1)
   B()
}

B() {
   X->Y(2)
}

X::Y(int i)
{
    if (this_thr != my_thr) // also check if uninit...
        convert_to_full_sync();
    work();
}
    • *If the only call to B is through A, then the compiler can optimize this as:</p>
A() {
   if (this_thr != X->my_thr)
        X->convert_to_full_sync();
   X->Y(1)
   B()
}

B() {
   X->Y(2)
}

X::Y(int i)
{
   work();
}
    • *In order to get the full benefits of optimization, the optimizer needs to treat threads as first class citizens in the analysis phase. Any routines which can be the root of a new thread needs to be identified.  This is challenging because any external functioncan be used by another functions as the root of a thread. Optimizations involving synchronization can be applied tothe program in those places where a localized group of functions will always be called by the same thread.</p>

I don’t know if such analysis is done by any of the major native (non-interpreted) optimizers out there, but it would probably need to be tied in with a few directives for things like declaring thread root-functions and declaring synchronization primitives, etc.

P.S. I know that my example doesn’t demonstrate why root functions need to be identified, and the caller/callee analysis is something that’s already done.  Hopefully I’ll have time to post a better example later.