Yes, that's a bug. Keep in mind hello_stdcl gets more attention from the developer than hello_opencl.
Its implied that all of the "hello" program are not efficient - they are tutorial in nature. If you really wanted to make this example as efficient as possible, you would start by changing the host program to use a global index of 16 and fold the remaining parallelism into the kernel. Next you would do something along the lines of what you suggest. You would want to have, perhaps, thread 0 read in a chunk of b, then write a copy to thread 1, then use that chunk of b. Then like a chain each thread would copy the chunk of b forward, then do the sub-calculation using it. In this way b is read once and recycled. This is what you are suggesting I believe. There is a tiling algorithm here to orchestrate.