Interesting feature indeed, I'll definitely try using it at work when benchmarking, thanks for sharing.
As far as performance of the sample core goes -- it was not designed for high throughput. I do not expect it to be faster than a cpu based implementation of the same computation. Sample core is using AXI-Lite memory mapped interface. It's the easiest one to get going and there are ready-made examples on-line from which I borrowed heavily, and this is why it was chosen. This interface is good for low-volume data transfers: setting parameters, controlling low-bandwidth peripherals. For typical accelerator design you would want to use AXI-Stream. Your custom block would have one axi stream input (slave) and one axi-stream output (master), and maybe another AXI port for run-time configuration. This block will consume data from input stream, transform it in some way and write to the output stream. You then connect your block to DDR memory using AXI-DMA as described here: . Just replace FIFO with your custom block.
I have followed example linked above on Parallella and tested it under Linux using UIO driver for controlling DMA block, this worked fine. I am yet to implement custom block with stream-in-stream-out interface. Vivado can generate a template for that kind of block, but obviously writing all the logic is up-to you. And it's not clear to me yet, how much of AXI-stream understanding is required to write one. Another option is to use Vivado HLS, I think trial license is 30 days.
I haven't found yet any written down tutorial on creating axi-stream custom block, so I'm slowly going through this video tutorial . I'm upto the middle of the video 3.