Single precision Linpack benchmark code

Hello,
I have been using the hpl code to run the linpack from here: http://www.netlib.org/benchmark/hpl/
I am using the distributed version, and it is very good. The situation is this:
I compile a custom BLAS, have my own MPI library, and I link both of them to the hpl code. The hpl code runs perfectly with my custom BLAS-MPI, in a distributed cluster. The only problem is that it is made for double-precision (and my Epiphany-BLAS in double precision is very slow...).
Does anybody know any similar implementation for single-precision?
The basic feature I need is: "ability to compile with custom BLAS"
A secondary, but desired one: "ability to run on distributed hardware/ support for MPI"
P.S.: Even in single-precision the BLAS still needs a lot of improvement, but I am in about 1.4GFLOPS (per parallella node in A[128x7808] x B[7808x128] single precision), with data transfer,pre and post processing included; I think I can improve 3 or 4 times that with changes in the outer algorithm and adding some assembler in the inner function also; by now it is all C code and I have clearly identified the function to optimize (luckily it is the one that makes the scalar multiplications...); I also think I could get close to the theoretical peak with some improvement in the e-link FPGA speeds.
The "outer" algorithm used is not Cannon's. The current performance is not better than ones seen before by now, but I think the algorithm will let further improvements (and that is the advantage over Cannon's), for this platform (working on it).
I have been using the hpl code to run the linpack from here: http://www.netlib.org/benchmark/hpl/
I am using the distributed version, and it is very good. The situation is this:
I compile a custom BLAS, have my own MPI library, and I link both of them to the hpl code. The hpl code runs perfectly with my custom BLAS-MPI, in a distributed cluster. The only problem is that it is made for double-precision (and my Epiphany-BLAS in double precision is very slow...).
Does anybody know any similar implementation for single-precision?
The basic feature I need is: "ability to compile with custom BLAS"
A secondary, but desired one: "ability to run on distributed hardware/ support for MPI"
P.S.: Even in single-precision the BLAS still needs a lot of improvement, but I am in about 1.4GFLOPS (per parallella node in A[128x7808] x B[7808x128] single precision), with data transfer,pre and post processing included; I think I can improve 3 or 4 times that with changes in the outer algorithm and adding some assembler in the inner function also; by now it is all C code and I have clearly identified the function to optimize (luckily it is the one that makes the scalar multiplications...); I also think I could get close to the theoretical peak with some improvement in the e-link FPGA speeds.
The "outer" algorithm used is not Cannon's. The current performance is not better than ones seen before by now, but I think the algorithm will let further improvements (and that is the advantage over Cannon's), for this platform (working on it).