Parallella Community

Posted: **Mon Feb 01, 2016 7:44 pm**

Hello,

I have been using the hpl code to run the linpack from here: http://www.netlib.org/benchmark/hpl/
I am using the distributed version, and it is very good. The situation is this:

I compile a custom BLAS, have my own MPI library, and I link both of them to the hpl code. The hpl code runs perfectly with my custom BLAS-MPI, in a distributed cluster. The only problem is that it is made for double-precision (and my Epiphany-BLAS in double precision is very slow...).

Does anybody know any similar implementation for single-precision?
The basic feature I need is: "ability to compile with custom BLAS"
A secondary, but desired one: "ability to run on distributed hardware/ support for MPI"

P.S.: Even in single-precision the BLAS still needs a lot of improvement, but I am in about 1.4GFLOPS (per parallella node in A[128x7808] x B[7808x128] single precision), with data transfer,pre and post processing included; I think I can improve 3 or 4 times that with changes in the outer algorithm and adding some assembler in the inner function also; by now it is all C code and I have clearly identified the function to optimize (luckily it is the one that makes the scalar multiplications...); I also think I could get close to the theoretical peak with some improvement in the e-link FPGA speeds.
The "outer" algorithm used is not Cannon's. The current performance is not better than ones seen before by now, but I think the algorithm will let further improvements (and that is the advantage over Cannon's), for this platform (working on it).

Posted: **Mon Feb 01, 2016 10:50 pm**

I probably won't be able to help, but which BLAS routines do you need? HPL doesn't use all of them and you said you had some already.

Since it seems like you're using 128x128 matrices on-chip, you'll be fundamentally limited by the off-chip bandwidth unless you find a method for fully overlapping communication with computation
Excluding the write to C, you have to read the A and B matrices (2*128*128*4 bytes = 128 KB) with 2*128^3 floating point operations for an arithmetic intensity of 32 FLOPS/byte. With an off-chip read bandwidth of ~90 MB/s, you're limited to 32*90 = 2.88 GFLOPS if your matrix-multiply was peak performance. That 90 MB/s is the old e-link bandwidth. I recall Andreas tweeted an improved e-link to something like 160 MB/s. But that would still only get you up to ~5.1 GFLOPS.

Posted: **Tue Feb 02, 2016 4:23 am**

Hello jar, thanks for the answer. I'll try to explain.

OK, by parts (I won't comment everything here, it's too long, that will be when [and if] I have actual better results):

1) I am already (almost) fully overlapping communication with computation (ARM handles input communication by now, while Epiphany computes [and it is done in chunks of 128x128x2 on the input]; output communication is irrelevant in this implementation)

2) I am using e_write, by now, with an experimental bandwidth of about 45MB/s (Which probably gives the 1.4GFLOPS limit, I've done the math before but won't check it now; it is your 2.88GFLOPS/2 [for 90MB/s] in any case, so it sounds good). I know I can use ERAM, and possibly double it, but I am working on the other parts first.

3) I know the limits of multiplying 128x128 matrices. That multiplication is done "on-chip" and, as you say, can't achieve much better performance (it can improve with the bandwidth but not by other means). It is possible to improve to 180x180 matrices on chip (I've done it) but that's not the interesting point (little improvement). It is also possible to accumulate the results, as you move on " the k dimension", to reduce the weight of the "post-processing" (I am doing that, and post-processing goes down to less than 2%).

4) Now the interesting part: I am using an algorithm that "extends nicely" to matrices that are bigger than the storage of the chip. For example, I have one implementation of "pure SUMMA" algorithm, for A(512x512) B(512x512), in which the input transfer time is totally irrelevant. As a secondary effect, the output grows a lot (and that is O(N^2) so you normally don't want it...). But that is with "pure SUMMA + ARM postprocessing". The interesting thing is:
"pure SUMMA" = little input/ big output (a lot of ARM post-processing, and Epiphany->ARM comm)
"pure classical" = big input/little output (a lot of ARM->Epiphany comm)
What I am doing is mixing both, hoping to find a good equilibrium that, if lucky enough, will reach close to the computation speed bounds (hopefully with some help on the bandwidth side... which seems to be possible thanks to Andreas; but the point is "I think it is possible to be more efficient on the outer algorithm side also").
I expect to have an answer to this "luck" issue this week or the next (I hope to finish the "outer part of my matmul", still missing the inner function optimization, but being able to predict the "data transfer" limitations already. For the "inner function" I may borrow the "matmul_optimized" assembly code, that comes with the parallella examples [the authors are from the Australian National University]).

5) Postprocessing is important for me, as I am not making "matmul" but a complete "sgemm" kernel to be used in BLAS. That means taking into account alpha, beta, cs_a, rs_a, cs_b, rs_b, cs_c, rs_c,... etc... (I've been learning about those on the run, hehe, but now I seem to handle them ok).

Finally: If I get a good kernel I already have the needed "surrounding technology" to build a "kernel-powered BLAS" within an afternoon, and I have a non-standard test that I can run on it. But I want Linpack! (imagine the voice of a crying baby saying that

), and the (great) hpl version of Linpack I have is for doubles.... that's the situation by now... (also have plans to improve the "double" version with "extended precision" as (I think it was you, in fact) suggested, but that will probably take more time [unless there is a volunteer... it's totally parallel work: optimizing an already existing ASM function to work with extended precision]. I still haven't even tested the ASM compiler even once, so I would have at least some time to get used to it, installing the tools, etc).

Posted: **Thu Feb 04, 2016 2:30 pm**

Update:
Still not a complete implementation, but my estimate is that my approach can cut the bandwidth restriction by about a half (I get 2,45GFLOPS limit with 45MB/s, so 5,1GFLOPS could become about 10GFLOPS). That's working with N=256, which (packing the things on the internal banks) can be used without "intermediate ARM processing". I can use N=512, and get (x2) GFLOPS limit, and so on, but that requires summing intermediate results in the ARM, and that costs a lot.

That would be all true for Matmul A(256x256) x B(256x256).

Now there is an additional problem with sgemm. There you have to do C_out = alpha*A*B + beta* C_in
The great problem is that "+beta*C_in". There is a postprocessing cost of summing 256x256 floats with the ARM (of course it would be terribly worse to do so by sending to the Epiphany).
The weight of that cost can be reduced if I can use a bigger K (for example stream-accumulating in the Epiphany as in my 128x7808 implementation), but I can't accumulate and at the same time take advantage of the 2x input boost.

Resuming:
- in Matmul, input restrictions can be boosted x2 (without "ARM overhead"), and possibly more with a bit of "ARM overhead" (I have yet another trick for the "intermediate ARM processing", but didn't work with N=512 because I overflowed the ERAM [the idea is, of course, to store all the intermediate results without accumulating and summing all at once, in the end]). It works with N=256, and I have some hopes for N=384.
- sgemm has an additional problem.

It could all improve if there exists a superb 2-core summing algorithm already implemented for ARM.
My second bet would be a parallel SUM (DMA style) algorithm in the FPGA but that is a long shot (may probably suffer from the same I/O problems as the Epiphany... I suppose it has faster access to CPU, but the same problems when accessing RAM, not sure...)

UpdateUpdate: Maybe the ARM limit is not so bad. Initially I got 3.5GFLOPS limit. I compiled with -O3 (in the host side) and got a limit of 7.9GFLOPS. Perhaps some loop unrolling could be enough... I doubt a multithreaded version could do much better, because of the DRAM access... but who knows.
Update^3: 32 loops and -Ofast can get it to 9GFLOPS. I think I am close enough to 10 GFLOPS think this won't be a limitation. Now I need to complete all the missing things, to get the real implementation (ASM improved computation... without eating all internal RAM [I can do that, I am sure, but still not done], and I/O updates with ERAM and later new FPGA e-link).

Posted: **Thu Feb 04, 2016 5:23 pm**

How are you doing your timing?

Posted: **Thu Feb 04, 2016 6:04 pm**

clock() function in the host-side, from standard time.h library.

I can still do it from the host side as I am using e_write to send the input data. Postprocessing is in the ARM.

The total Epiphany time is measured by the ARM, from "ARM puts shared byte up" until "Epiphany puts shared byte down".

My method for measuring time inside the Epiphany, by now, is a bit rough... but I did not report none of those times

(what I do, as I don't want to play with timers yet, is this: I just comment the code block that I want to measure time, and uncomment. Then substract

(no conditional branching in commented code, and precision is relatively good it seems [it adds up correctly to the total], but I wouldn't call that "measurements"...).)

To give an idea, my debugging method consists of changing code inside the kernel and watching the results in the ARM ssh screen... (I have a "one message per-run buffer Epi-ARM 'printf' ["Hello world style"]" [can write more than one per run but would not be deterministic] that I used in earlier stages of debugging, but now I handle with outputs...). The best programming practise for my "custom IDE" (program in PC, rsync script, compile, load, run and hope for the best) consists in not making mistakes

NOTE: I also compiled an entire BLAS with some of the kernels and used external tools to measure performance. It was in agreement with mine (for earlier versions; this one still needs work). The total sgemm, of course, has less performance than the kernel.

NOTE2: In fact I have already run some linpack tests, with some of the kernel powered BLAS. But to do so I had to change every "float" with "double" (make size in internal RAM for them also...), and performance is so bad that it is not worth reporting... (ARM does better for doubles, by now; at least for those early implementations I tested... maybe now it could improve, but don't think too much)

Posted: **Fri Feb 05, 2016 9:02 pm**

It is worth noting that in the last emotional "UpdateUpdate" there was a mistake in the postprocessing times. Those didn't take into account the time that takes the ARM to read from ERAM. So e_read(&memory) is limiting to 2,4GFLOPS, by now. :oops:

Nothing to do with SEND or RECEIVE, to/from Epiphany. Simply ARM host and it's RAM...

If I could access ERAM with a normal C pointer, from the host (would take care not to do it at the same time as Epiphany), I suppose that would stop being a problem... (will investigate that)

I can get about a limit of 12-13GFLOPS, for postprocessing limits, if the pointers are normal C pointers...
The boost in the input is real (I get the 2,8GFLOPS limit with 45MB/s, and that one grows linearly with I/O speed), but of course is worth nothing if the postprocessing times limit the flow... (but I believe that the ARM should be able to read it's RAM "C style" faster... hope so)

P.S.: I posted too fast last time... may need to switch verbose mode off :oops:

I hope I will be able to write a real paper and send some code later (without "real time updates" mistakes)

Posted: **Fri Feb 05, 2016 9:39 pm**

Yes, that sounds about right. We all make mistakes.

Counting memory transfer time, you'll be limited to < 3 GFLOPS. If you don't count the off-chip transfer, it is possible to get closer to peak performance. Because SGEMM scales O(N^3) while communication scales O(N^2), it would be possible to achieve better performance on a hypothetical chip that had more memory per core, more cores, or both. But for now, you're coming up against some hard limits of the current implementation for offloading computation. More eLink bandwidth would also help.

Posted: **Tue Feb 16, 2016 8:36 pm**

If I don't take into account off-chip transfers, I am getting about 12GFLOPS (I know better results have been achieved... in fact I "stole" some ASM code to get that...).
I modified A. Varghese's ASM code, to fit my implementation (changed some strides, made it calculate for smaller k, etc.). The interesting thing is that after modifications, I lost some performance, but the code is much smaller (I didn't need so many "macro" copies...).

I can fit all my source code, all my "shared control" variables, and the stack in BANK 1 (I know, code and vars in the same bank is bad thing, but the performance penalty, compared to the gains, is unnoticeable, by now). That way I could accomodate, with effort, 256x256 matrices (need some tedious buffering methods, but can be done). The idea is to store one entire input, and roll the other. That way you never repeat "input sends", and you save half the space (aprox).

By now, I have 2 "matmul architectures", very different.

One is an "accumulator", that is always limited by the input bandwidth, and can only multiply matrices as long as the results fit in the internal memory banks (k can go to infinity on this implementation, but N is limited by N^2 < [2 BANKS SIZE] if I work a bit more, and everything goes well). Not too interesting. At the moment I have the limit N^2 < [1/2 BANK], and my best performance, after using DMA (but not yet the new e-link) is 2.25GFLOPS, in total (without off-chip, this one goes to about 9GFLOPS... ASM with smaller k...).

The other is the one I thought could really do something new. It gets "bigger than reasonable" Ns, and sends a lot of outputs. It has an N-Kinternal compromise. If the output + postprocessing were no concern this one would be very fast. I thought that would be the case, as writing to ERAM is faster than reading. But there is the ARM <- ERAM issue... Anyway, with Kinternal=256, N=128, output and postprocessing limit is about 4GFLOPS, while input remains in 2,3GFLOPS. Again, it is possible to do some RAM saving magic, and use N=256 with Kinternal =256 (don't know if will be worth the time). That should double the input performance leaving the output unchanged. This implementation is always limited by the size of N*Kinternal (the original idea was, "let's rise N and lower Kinternal and then sum the partial results in the host").

So, by now 2.25 GFLOPS. One idea for each implementation to double it (hopefully). Still without e-link improvement.

Posted: **Tue Feb 23, 2016 2:24 pm**

Just got 3.4 GFLOPS/s (real, total parallella board).

This implementation could potentially grow up to 4 GFLOPS (will not work on it past today). Still didn't change fpga e-link.

It's A(192x3072) , B(256x3072) , C(192x256)

Parallella Community

Single precision Linpack benchmark code

Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code