When it comes to designing and building high performance, highly available systems, the basic concepts are not rocket science, yet it’s surprising how easy it is to get wrong.  Below are some resources that might help you to get it right the first time:

Designing for reliability:

Principles of Resilient Design

From Scott Jackson, one of the most concise papers on designed resilience of systems I’ve found.

Part of a wider set of resources provided by the international Risk and Governance Council here:

IRGC Resource Guide on Resilience

Although I’ve yet to acquire a copy, Scott also has a book “Architecting Resilient Systems” available on Amazon:

I understand an electronic version may exist.  If I find it, I will update this post accordingly.

Robust Communications Software

I went to a lot of effort to acquire this book and wasn’t totally enamoured with it when I got it.  But it has a lot of great, grass roots information on how to build robust systems, whether you’re in the telecommunications industry or not.   Will probably make more sense if you’re already a software guy.