Think I got at least one problem figured out. If copy memcpy from into my own code, it results in roughly the same benchmark values as when I adapt my own code to copy 8 instead of 64 bits. So I guess the low memcpy performance comes from the standard C library being in the external dram instead of SRAM.
However I am still wondering why O3 will, instead if optimizing my code for speed, makes my code so much slower.