Parallella Community

by **tnt** » Fri Jan 25, 2013 3:40 pm

Reading from neighbor code is a a massively slow operation compared to a simple instruction, I mean, it's like 10s of cpu cycles of pipeline stall ...

by **piotr5** » Fri Jan 25, 2013 8:31 pm

I'm aware of that, the manual isn't concrete on that though (10+ cycles for external read). the point is that no actual stall happens during reading when done right, you just need to occupy the core for those 10+ cycles with other stuff, and as I said the number of cycles spent with preparing for the actual division-loop is around 50 cycles, plenty time for external lookup. the goal is to reduce those 50-170 cycles needed for divsi3 without loss of accuracy.

btw, I have an idea for a radix256 algorithm without look-up: shift the numbers till you only need to divide by 1. look-up would be interesting for other stuff though...

by **aolofsson** » Fri Jan 25, 2013 9:27 pm

Sorry to disappoint you

but the core stalls until the data comes back in the case of a load transaction. It would have been very expensive to allow the core to continue operating while there are external loads pending.

by **piotr5** » Sat Jan 26, 2013 1:23 am

you emphasize "load transaction" because with an external core storing data wont block the receiving core? also what about dma-transfer, in the mode where interrupt is requested the core goes to sleep till the transfer is complete? and I found in the manual that moving data from core to neighbour takes 1.5 cycles latency plus 1cycle for the cMesh transaction, is that correct? (this "Transactions move through the network, with a latency of 1.5 clock cycles per routing hop." in 5.1 contradicts the example given there: "A transaction traversing from the left edge to right edge of a 64- core chip would thus take 12 clock cycles." or have I misunderstood something when I associate "latency" with a delay caused by the actual data-transfer as opposed to the delay happening lateron through the protocol?) why is in another chapter then the cycle-count of 10+ given, does this other info refer to the rMesh transaction time of 8 plus these 1.5 cycles of latency? to clarify: I suspect for the rMesh (i.e. reading operation) you need 8 cycles plus 3 cycles per hop, while for the cMesh network (i.e. writing) you need 1 cycle plus 1.5 cycles per hop. (xMesh is for off-chip, probably including every memory-location outside of the 32kb stored in each core -- read and write doesn't make any difference, both are slow, allegedly about 60 cycles.) that's why I think one can program around the blocking slowness and do a non-blocking table-lookup in about 10-20 cycles. question is just: how much time is lost on the other core doing the write...

by **tnt** » Sat Jan 26, 2013 8:08 am

Well IMHO:
- It's not acceptable to use DMA from a 'gcc library' like idiv ... the user might be using the DMA for something really useful.
- You can't have the other core executing code to write the results ... again, that would imply that a library function on one core can trigger code on another core ? That's just asking for trouble ...

by **mrgs** » Mon Jan 28, 2013 10:33 am

Well, IMHO: We have to divide --- copy ?! --- this 'issue' into two parts. On one hand everybody like 'a nice math library' and the other hand, nobody?! like to suffer about HW 'weaknesses'. From my point of view: I focus to the advantages of HW + SW, and try to solve the 'problematic' points. So, (#1) we need 'standard' math libraries, and (#2) we need a mechanism which deliver data and 'code' for the cores 'just on time'. I think both are possible and independent from each other. Additionally there is no another way. --- I mean WE have to figure out, and solve these! Am I right? --- (!) FIXME --- Regards, Gabor

Parallella Community

integer division

Re: integer division

Re: integer division

Re: integer division

Re: integer division

Re: integer division

Re: integer division

Who is online