Parallella Community

by **mhonman** » Wed Nov 20, 2013 11:59 am

I've just come across a paper from Paralant describing their Insight library for Epiphany, that contains a super description of how to get the best from the Epiphany architecture, and includes measured speedups for various matrix sizes on a 16-core Epiphany at 1GHz/core.

The bottom line is that it tops out at 7Gflops for multiplication of 64x64 matrices - not at all shabby! [edit: as pointed by EggsBackonandSpam below, that is on just 4 cores]

This paper is very well written and could be used as a tutorial on getting the best from the Epiphany architecture.

by **shodruk** » Wed Nov 20, 2013 1:20 pm

Clearly, the bottleneck is in external memory access.

by **EggBaconAndSpam** » Wed Nov 20, 2013 2:33 pm

I beg to differ.
First of all, they are actually using 4 cores, not 16, therefore 7GFlop/s at 1GHz correspond to a 90% efficiency.
Secondly, for sufficiently large matrices, the operation is bottlenecked by the computation part( O(n^3) ) as opposed to the bandwidth/network.

by **shodruk** » Thu Nov 21, 2013 9:36 am

Excuse me. I should have written "off-chip memory access".
I'm thinking about why they had to restrict up to 64x64 matrices.
However, 90% efficiency is great.

by **mhonman** » Thu Nov 21, 2013 11:57 am

What I felt to be particularly cunning in their approach is to take advantage of the large register file, breaking the big matrix into smaller ones that can be multiplied in registers only (compared to the Adapteva example where the smallest matrix is sized for internal RAM).

I think though that there is no element of staging matrix chunks between internal and external RAM, as 3 64x64 matrices will fit into the internal RAM of 4 cores. It would be interesting to know whether the Insight library product runs up to the 512x512 arrays that the Adapteva demo can handle, and how performance scales once the matrices cannot fit into Epiphany internal RAM.

by **shodruk** » Fri Nov 22, 2013 11:39 am

I suspect their library couldn't handle 512x512 matrices at that time.
Because Parallella's off-chip memory bandwidth is rather low as compared with its high computational horsepower, I think the key factor of optimization technique for Parallella is to minimize off-chip memory access.
And, it should be very hard.....(but fun!)

by **mhonman** » Fri Nov 22, 2013 12:31 pm

by **shodruk** » Sat Nov 23, 2013 12:03 pm

by **sriranjan** » Wed Feb 03, 2016 7:41 am

Could someone point me to the exact code (library and example) for matrix multiplication using the 16 core chip like yaniv showed in the video.

by **ubii** » Wed Feb 03, 2016 9:29 pm

How about this? - https://github.com/parallella/parallell ... sembly-opt

Parallella Community

Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Re: Matrix multiplication on PARALLELLA IV

Who is online