You should check out merge sort.
I'd ship the data to the cores in chunks, then use heap sort (can work in-place, but isn't a "stable" sort) on the data while it's there, and use other spare cores or the CPU to merge the sorted lists. Use the main memory available to the CPU as a form of backing storage, and ship the merged lists out to disk if that runs out.
edit: I didn't actually notice that you mentioned both heap and merge sort already. I'll just add that both add challenges and opportunities...
* effective merge sort will depend on how you lay out the programs running on each core, but will ultimately be limited by CPU<->Epiphany bandwidth
* actual merge step takes very little time, so it might be worth combining two heap sorts and a merge in each core
* heap sort has nice (fairly) predictable behaviour for adding new elements and removing the minimum (or maximum); you can use guaranteed worst-case behaviour to help scheduling
* it's also more efficient to add multiple data points in one go using heapsort rather than adding one at a time
* you could use multi-buffering to feed multiple heaps from the bottom (supplying one full level at a time) so that you're interleaving data transfer (DMA) with computation as much as possible