Parallella Community

by **dar** » Mon Sep 15, 2014 10:29 pm

Yes, that's a bug. Keep in mind hello_stdcl gets more attention from the developer than hello_opencl.

Its implied that all of the "hello" program are not efficient - they are tutorial in nature. If you really wanted to make this example as efficient as possible, you would start by changing the host program to use a global index of 16 and fold the remaining parallelism into the kernel. Next you would do something along the lines of what you suggest. You would want to have, perhaps, thread 0 read in a chunk of b, then write a copy to thread 1, then use that chunk of b. Then like a chain each thread would copy the chunk of b forward, then do the sub-calculation using it. In this way b is read once and recycled. This is what you are suggesting I believe. There is a tiling algorithm here to orchestrate.

by **djm** » Mon Sep 15, 2014 11:22 pm

Thanks again for the info.

I see. IIUC you are saying it's better to launch the kernel once on each core and then have each kernel loop (in this case 64 times, i.e. 1024/16). Would that be more efficient?

I have been playing with the OpenCL API since I have bindings available from the language I'm using. I may switch over to stdcl at some point.

Thanks,
djm

Parallella Community

hello_opencl.c

hello_opencl.c

Re: hello_opencl.c

Re: hello_opencl.c

Who is online