I've used this following lib in the past and it was quick - and easy to use and port. It was a while ago but at the time it was certainly very fast on NEON:
https://github.com/anthonix/ffts
edit: looks like it's not very good once the data outgrows the cache, so looks like it needs some work. in-cache though it's quick.
edit2: oops i was timing the verification code, no, it's fast.
It has a code generator so could be modified to create epiphany code directly although code-size might be an issue (and it's not trivial). This sprang out of another project which might be equally applicable to epiphany though:
https://github.com/anthonix/ffts-fpga
An fpga implementation.
This next library(-suite) may have some useful building blocks. Since the epiphany isn't SIMD some of the more exotic libraries shouldn't be necessary for good performance.
http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html
I've done some poking around from fundamentals but nothing i'm coming up with is very fast so far. I hit some hardware/firmware issues which killed the mood a bit.
FWIW ... Although the C compiler does a pretty good job of the arithmetic and scheduling it's not very good at the memory accesses and indexing so i've been playing with some assembly as well although haven't got to the point of timing anything to see if it really matters.