Sorry, you're right. Each e-core has 32KB of local memory (not 8KB).
Thus for dot product, we could theoretically place two 16KB arrays on each e-core. With 4 bytes per integer, it will be about 4,096 elements per core. Over 16 cores, we could support arrays up to 65,536 in length without requiring a fetch to main memory. The DMA cost I was talking about was the DMA channel between the ARM and the Epiphany chip, not the elinks between cores. If we want to exceed array sizes of 65,536, we would have to write our program in such a way to use the DMA channel to fetch new portions of the array from the 2GB main memory bank on the ARM chip. As the array gets large, I imagine this performance cost will get prohibitively high.
Back to the issue with greater than 4096. The SOP for two integer arrays consisting of (i=0... n-1) for n = 4096 is 22,898,104,320. This exceeds the capacity of a 4 byte integer or long. There is a long long type that is 8 bytes, which should hold this value. However, I ran into trouble when I tried to change the type of the sop variable to be unsigned long long. That's why I left it as an open problem.
-Suzanne