I made that image quite a while ago and I think you're correct to point out that it has 16 compute units each with one processing element. (the image is incorrect). It was generated before I understood the specifics of the COPRTHR OpenCL implementation.
You can use OpenCL to program the Epiphany cores. You should create 16 work groups with 1 thread per work group. Creating more than 16 work groups or more than one thread per work group will oversubscribe the hardware and result in worse performance.
OpenCL, as the standard exists, lacks a mechanism for inter-core communication and is just one of the many failures of the Apple/Khronos API design. The model was designed with GPUs from 2008 in mind because CUDA was winning. The OpenCL C language specified memory locality, but not accessibility. Accessibility was implicitly determined by the locality within the standard. Thus, the standard fails to account for architectures like the Epiphany. Either you break the OpenCL standard by introducing non-standard communication mechanisms resulting in non-portable code, or you keep to the standard and accept the poor performance achieved by global memory synchronization (a weak point with the current Parallella/Epiphany design).
If you're writing OpenCL applications for the Epiphany cores, you will probably be reading and writing to global memory rather than reading/writing to neighboring core local memory. The OpenCL private and local shared memory are the same thing within Epiphany. But since you have a work group size of 1, shared memory is a silly concept (shared with one core).
The OpenCL concept of constant memory does not exist in hardware on the Epiphany cores, but each core has access to 32KB of core local memory which can have small constant data structures replicated across each core.
Hope that helps you. Sorry for the misunderstanding with the old figure.