My only experience with Transmeta CPUs was ... bad. As in "the CPU got used to [application A] and after exiting it took minutes to return to [application B]", or when a 500 MHz processor feels much slower than a 486/33. So I am definitely burned a bit there. Also, I have seen an AVR-based Z80 emulator, so I know that it is possible. A table-driven instruction translator can be very small indeed, but I assumed that a dynamic recompiler would do more than that.
In any case, since I am obviously missing expertise there, do you have any useful references? Especially the "... as the cycles saved in doing this pale into insignificant against the cycles lost by poor scheduling" part is very interesting to me; how can a runtime work around stalls (how does it detect whether an instruction will stall without actually executing it)? Or would this be on some higher level instead of instruction-granularity?