I'm absolutely convinced the epiphany architecture would be perfect for this sort of thing, some similarities to the movidius , but it could go further in future versions (the designed, but not produced 1024 core version..) .. although the existing board might not perform well compared to more recent devices.
The most important algorithm IMO would be
Convolutional Neural Networks, you'd have to figure out the best way to use the local-stores (something like 'storing one filter per core'). e.g. you'd probably parallelise the evaluation of each 'feature map' in layer [n+1] from layer [n], streaming the 'current layer' in across all the cores.
regarding the fact that Movidius has one big shared 'on chip memory', you could still partition off a fraction of each e-core for that purpose (e.g. 32k could be 24k used by the core, and 8k x number of cores as a shared area used by all).. many ways to work.
also consider that CNN's have been shown to work ok with 1bit precision
https://news.ycombinator.com/item?id=11320896 , which could help squeezing more on chip. (I hope the 1024core version has a 'pop count' instruction that would help this, they list cryptography extensions so it's likely..)
the 1024core version has almost* 64mb of on chip memory - that's enough to hold the data for some vision nets entirely on chip,
with enough cores, you could pipeline the whole thing (for inference).. a group of cores *per layer*... the scope for parallelism is practically unlimited (layers x feature maps x width & height across the image).
I regret I never got into actually doing anything here (i never even got a board) .. if anyone
did implement this it could grab a lot of attention and increase the chances people get behind the 1024 core version , but there's rival architectures with different approaches ongoing (i note the nvidia-gpus are getting 4x4 'tensor units' to accelerate one step... I still think the e-cores could do it better)>
CNN's are a proven algorithm that seem to handle so many use cases, I think they've done experiments with self-driving cars purely driven by them, etc (e.g. image input, steering output).
(* minus dead cores, whatever % that will be)
multi-layer Convolution with a user defined function applied to the result would be a great primitive in PAL (might need some #define abuse to make this manageable in C), (or some templated library building on what jar has presented)