I've got a Zedboard (Zynq-7020 based) and have a project for generating Mandelbrot Fractals, it clocks at 300 MHz, performing two complex 36-bit fixed-point MACs per cycle, giving me 700M CMACS per second. (This is using only 24 of the Zynq's 220 DSP blocks).
Once I get the Frame buffer out of the Zynq's block RAM (which is harder than it sounds...) I'm hoping to get around 5,000M CMACs.
The first I want to do is see how the Parallella stacks up, in not only performance but code development time.