The paper ``Extended-Precision Floating-Point Numbers for GPU Computation'' by Andrew Thall may be of use for the epiphany.
It outlines some basic flops for a double-single format that I presume should be faster (and smaller) than a software-ieee-double library. All operations are decomposed into 32-bit flops so can run on hardware. It approximately doubles the mantissa accuracy but doesn't extend the exponent.
I downloaded it from here: http://andrewthall.net/papers/df64_qf128.pdf