Parallella Community

by **peteasa** » Sat Aug 20, 2016 8:31 am

Hi Miguel,

Read your paper with interest. I assume that you are not yet using the mailbox to signal (interrupt) the end of the calculations to the arm chip. If you did have an efficient way of signalling the end that there would be less of a requirement for the accumulator and that you might then get a better performance? I also did not notice much about dma in the paper. Do you think that using dma between the arm and the epiphany would also get better performance, or are you already using dma?

Peter.

by **MiguelTasende** » Mon Aug 22, 2016 1:33 pm

Hi Peter,

Thank you for your interest in the work, and for your comments.

No, I did not use any interrupts from the Epiphany to the host. I am polling a register, in the host code, to find the end of the calculations. If it was possible "not to accumulate", I think the performance could grow a lot, but I think the limitation to do so is not in the signaling of the end of calculations by the Epiphany. The problem is that after the Epiphany signals the end of calculations, the host has to read the resulting data from the RAM. It happens (still don't have a really good, complete, explaination) that reading the "shared zone" of the RAM is very slow for the host. Reading other areas of the RAM is fast, but the shared area is slow. That seems to be the limiting factor to abandon the accumulator. In any case, I like the idea of using interrupts to signal the host, and it could improve the software in other, more creative ways, but the simple limitation that I see now is not that one.

I am using DMA in the code (on the device side; I've read about host DMA in FPGA but didn't try that at all). By now I am just using it because it is faster than getting the data one by one, I am not using it concurrently with the rest of the device code. That could be done, but for now the device code (the one that is not transfering data) takes too little time to make it relevant (I mean, I could parallellize the data transfer in the device with the real multiplication, but by now the multiplication takes too little time to gain something good, and to make it worse the host data movement is still a limitation; with other previous improvements that could change).

by **peteasa** » Tue Aug 23, 2016 11:12 am

Hi Miguel,

The new oh hardware includes dma from the host to the Epiphany and also improves the memory access speeds. Plus with the mailbox interrupt back to the host I would expect improvements. It would be great if you could make the code available so that it could be worked on by others in the community and or perhaps added to the pal libraries!

Peter.

by **MiguelTasende** » Tue Aug 23, 2016 8:09 pm

Hi Peter,

Great! (I haven't closely followed the latests developments, but if there is host-device dma, that may be great)
I hope to have news about the code release in the next few days...
I can't wait to see what you and others can do with it

.

Miguel

by **MiguelTasende** » Wed Aug 24, 2016 4:26 pm

Good news.
The code is released under Mozilla 2.0 license.
It is not perfectly "polished" (there may be some unused files included, and you'll find many comments and variable names in Spanish, among other things), but it works (at least here... hope to hear about other people testing it).

The link to GitHub is here:

https://github.com/mtasende/BLAS_for_Parallella

IMPORTANT NOTE: Most of the use cases require running 2 processes on the Linux host. That is explained in the README.txt, but it is something different from a regular code, and easy to forget.

Parallella Community

Linpack

Re: Linpack

Re: Linpack

Re: Linpack

Re: Linpack

Re: Linpack

Who is online