by amylaar » Sun Jan 20, 2013 3:10 pm
What algorithm fits an architecture depends on things like instruction set, available registers, and the costs of
code / data storage and access.
For sh4, I've used floating point division.
For the SH64, I've used inline code that computes an approximate inverse and a correction factor, to multiply the dividend
by (there's sort of a Newton-Raphson step in there, but it's been arithmetically transformed for scheduling purposes),
but that is only a sensible space/time trade-off because the SH64 has 32*32->64 bit multiply, and 64 bit adds and shifts.
For sh4-nofpu and ARCompact ARC700, I've used lookup tables.
For Epiphany, space is at a premium, so big inline code, tables, or complex algorithms are out.
Yes, you could distribute the code across multiple cores, but there is also a need for a basic implementation that can
run on a single core, or that leaves you to spread other stuff across multiple cores.
You might be able to find an optimization for a specific use case and Parallella grid setup using a different algorithm and
specific link strategy.