Nick,
Thanks for posting this. You mentioned the barrier (elib/eSDK?) dominates the time. This is a slow, linear algorithm. I've demonstrated a logarithmic time barrier that is 9x faster for all cores.
https://arxiv.org/pdf/1608.03545v1.pdf (section 3.6)
However, one should avoid collective barriers, particularly when only two cores really need synchronization in a message passing paradigm.
Also, "one to all" broadcast operations can be parallelized if each core is participating in the collective operation. The broadcast can occur in stages. For example, assuming core 0 is sending to all others:
stage 0: 0 -> 8
stage 1: 0 -> 4, 8 -> 12
stage 2: 0 -> 2, 4 -> 6, 8 -> 10, 12 -> 14
stage 3: 0 -> 1, 2 -> 3, 4 -> 5, 6 -> 7, 8 -> 9, 10 -> 11, 12 -> 13, 14 -> 15
By sending data "farthest first", it avoids some of the communication issues you discussed. For small broadcasts, the latency is overwhelmingly dominated by control flow and not network bottlenecks. Admittedly, my software design was optimized for large transfers, but it seems to do okay with small transfers. Using shmem_broadcast64 to broadcast 2 integers from one core to all 16 cores takes about 250 clock cycles (see figure 6). If I understand your figure correctly, it looks like your linear broadcast is around 500 clock cycles for 2 integers.
There is also a hardware broadcast feature that I never really explored because it's kind of broken on E3.