Parallella Community

by **claudio4parallella** » Wed Sep 27, 2017 12:14 pm

Hi all!,

please I'd like to share guide lines and experiences about the subject.
I'm studing the SDK for Movidius / Myriad X (puah!) even if there isn't yet the device.
What better than using Parallella Epiphany III with 16 cores to do almost the same.

I'm thinking to create and run a Neural network Environment (Input Hidden and Output nodes) per core, ready tyo be used in real applications.
It should be used both during parallel training and for parallel real time recignition having already the dedicated training data model.

I have seen nnP and FASTNN.
I'm thinking to create a sort of parallel nnP or FASTNN running one per core, if I am able to explain clearly what I've in mind
What do you think about my initial idea ?

Thanks for source, guide lines, links, material to study on and stress my parallella cluster....

My best to all of you

by **jar** » Wed Sep 27, 2017 2:58 pm

This sounds like a good project. The E3 on Parallella is rather bandwidth constrained so that in order to achieve performance, all of your network weights will need to be within on-chip memory. But then the on-chip memory limitations limit the size of the network. If you need a larger network, then the off-chip bandwidth limitations can be somewhat mitigated by running large batch sizes.

Good luck

by **dobkeratops** » Wed Sep 27, 2017 5:58 pm

I'm absolutely convinced the epiphany architecture would be perfect for this sort of thing, some similarities to the movidius , but it could go further in future versions (the designed, but not produced 1024 core version..) .. although the existing board might not perform well compared to more recent devices.

The most important algorithm IMO would be Convolutional Neural Networks, you'd have to figure out the best way to use the local-stores (something like 'storing one filter per core'). e.g. you'd probably parallelise the evaluation of each 'feature map' in layer [n+1] from layer [n], streaming the 'current layer' in across all the cores.

regarding the fact that Movidius has one big shared 'on chip memory', you could still partition off a fraction of each e-core for that purpose (e.g. 32k could be 24k used by the core, and 8k x number of cores as a shared area used by all).. many ways to work.

also consider that CNN's have been shown to work ok with 1bit precision https://news.ycombinator.com/item?id=11320896 , which could help squeezing more on chip. (I hope the 1024core version has a 'pop count' instruction that would help this, they list cryptography extensions so it's likely..)

the 1024core version has almost* 64mb of on chip memory - that's enough to hold the data for some vision nets entirely on chip,

with enough cores, you could pipeline the whole thing (for inference).. a group of cores *per layer*... the scope for parallelism is practically unlimited (layers x feature maps x width & height across the image).

I regret I never got into actually doing anything here (i never even got a board) .. if anyone did implement this it could grab a lot of attention and increase the chances people get behind the 1024 core version , but there's rival architectures with different approaches ongoing (i note the nvidia-gpus are getting 4x4 'tensor units' to accelerate one step... I still think the e-cores could do it better)>

CNN's are a proven algorithm that seem to handle so many use cases, I think they've done experiments with self-driving cars purely driven by them, etc (e.g. image input, steering output).

(* minus dead cores, whatever % that will be)

multi-layer Convolution with a user defined function applied to the result would be a great primitive in PAL (might need some #define abuse to make this manageable in C), (or some templated library building on what jar has presented)

by **claudio4parallella** » Wed Sep 27, 2017 6:21 pm

by **dobkeratops** » Wed Sep 27, 2017 6:44 pm

I wonder if they could use some of the remaining chips to build a grid, for dev purposes

one problem parallela/epiphany has is the architecture scales, but the *software* is hard to scale.

what you can do with 16 cores x 32k each versus 1024 cores x 64k each is qualitatively different... i.e. on the 1024 core device, you can keep an interesting problem entirely on-chip. That is a game-changer (... in how you structure your solution)

the reason GPUs/CPUs have the mainstream is it's releatively easy to just take the same code and run it on a bigger machine , so they could evolve continuously.

I was very interested in jar's template library (recent versions of the compiler apparently make it easier to generate e-core code from a main program): I think this problem I mention is *solveable*, and you could build dataflow primitives (such as convolution) and plug in different 'activation functions' etc to customise for your application

by **claudio4parallella** » Wed Sep 27, 2017 8:38 pm

If I well understand, please let me tell stupid things to start from somewhere, we need a very very light code in order to have a less than 32K sized "Covolution Neural Network" executable to be loaded at each core.
I'm looking for some very simple NN "source code sample" like "genann" but it's OVER INTERNAL RAM limit when I try to e-gcc it.
Any further "C" NN source code to start with, please?

by **claudio4parallella** » Thu Sep 28, 2017 1:33 am

I can run this example of Back propagation Neural Network

into each of the 16 or 32 cores individually.

Output is sent to Host Memory for being saved to file .

It works

by **polas** » Thu Sep 28, 2017 8:07 am

This is a very interesting topic, I have a poster into SC about doing something similar with Python on the Epiphany. The 2017 data science Kaggle was about using machine learning to detect lung cancer in 3D CT scans - I developed a simple NN in Python to do this and using the latest version of ePython (in the dev branch, but will be merged into master in the next few days) to decorate specific functions in the host code to them seamlessly offload them to the Epiphany. Basically offloading the linear algebra kernels involved with the feed-forward and back-prop onto Epiphany cores whilst the host does all other stuff. Each core works on a subset of the input image pixels in parallel.

If I am honest I did this as a test of ePython to really see whether it & the offload directives approach we developed was sufficient for an interesting use-case such as this. Having said that I was quite surprised that ePython performed better than a Python only version running in CPython on both the ARM CPU & a single Xeon core. However when I moved over to doing the linear algebra via BLAS then that completely beat my ePython version hands down.... which isn't particularly surprising I guess.

The major limitation I found (and this is partly a limit of the expressiveness of our offload directives in ePython currently) is the size of the memory - even with shared memory we had to drop down the resolution of the CT scanned images and it limited the size of the hidden layer too (we only used 1 hidden layer again for this reason.) We are currently doing some research on the abstractions around memory hierarchies to enable seamless (and asynchronous via interupts & the DMA engine) copying from the "external" (non-visible to cores) main memory into core accessible local or shared memory which will remove this restriction & if we get it right hopefully we should be able to hide much of the cost of data copying.

As an aside, quite a few of the algorithms I saw in the literature will store NN weights & values between each feed-forward pass and then combine these together for the back prop at the end of the batch. We don't have the memory for that on the Epiphany, so instead after each feed-forward I combine NN weights & values into a "running total" and then use this directly in the back-prop. It has exactly the same effect, but just a different way of doing it - this raised an interesting point in my mind wrt the changes we will likely need to make to algorithms in future to make them more suited for architectures with very high numbers of cores but low memory per core.

Cheers,
Nick

by **claudio4parallella** » Thu Sep 28, 2017 11:50 am

by **dobkeratops** » Thu Sep 28, 2017 12:24 pm

Parallella Community

Multi-core Neural Net Engine

Multi-core Neural Net Engine

Re: Multi-core Neural Net Engine

Re: Multi-core Neural Net Engine

Re: Multi-core Neural Net Engine

Re: Multi-core Neural Net Engine

Re: Multi-core Neural Net Engine

Re: Multi-core Neural Net Engine

Re: Multi-core Neural Net Engine

Re: Multi-core Neural Net Engine

Re: Multi-core Neural Net Engine

Who is online