I have not run the code, but it looks like your "tailEnds" (cores 0 and 15, presumably) do not wait for the DMA engine to complete before using the DMA again (in e_dma_copy).
Another concept that developers don't realize is that just because the DMA engine completed (after e_dma_wait) doesn't mean your data is where you think it should be. The DMA completing just means the last bit of command to move data across the network or e-link interface has been issued. You must perform a non-trivial check that the data completed moving. If multiple cores are reading or writing to that location, then you must also introduce synchronization. In the OpenSHMEM API, the coherence checking is handled implicitly by shmem_quiet after the call to a non-blocking shmem_put*_nbi or shmem_get*_nbi operation. The e-lib library provides no mechanism for this and is left to the developer.
Also, I have experienced some DMA weirdness that I never was able to pin down. I know that isn't helpful. In general, I avoid DMA with OpenSHMEM calls since synchronous copies typically beat it with Epiphany-III. Asynchronous/non-blocking code is also more complicated code. The painstakingly optimized shmemx_memcpy routine is the fastest way to write contiguous blocks of aligned memory with Epiphany-III (but it also handles misalignment).