Designing, developing, deploying and managing applications that are truly production-ready (that is, application that meet requirements imposed by not only the business owner, but also other important constituencies, such as operations, in a production setting, rather than just passing -- usually inadequate -- QA tests) certainly feels like an art to all practitioners, but in order to be done well it requires method and with it, rigor and discipline. This whole topic is even more important for those of us tasked with enabling an entire enterprise environment where applications are built and deployed to meet very demanding requirements and/or are under severe scrutiny.
Michael Nygard's "Release It!: Design and Deploy Production-Ready Software" (The Pragmatic Programmers, 2007) comprehensively discusses what is required to make production-ready software. The book is written around a number of important design and deployment patterns and anti-patterns -- Stability, Capacity -- with added best practices and (humorous, but instructive) actual examples. Timeouts/Blocked Threads, Circuit Breaker, Fail Fast,Testing Harness, resilience to transient impulses and others are certainly elements that we have learned the hard way to appreciate and enforce in our applications and prescriptions. Threading, logging, efficient coding, networking, topology, integration, testing, monitoring and management are all given due attention. A special place is given to aspects around applications/platforms operations management and the required transparency that will make or break the application once problems hit in production.
One of the interesting things that Nygard mentions is that the majority of developers are experienced in designing and building small to medium applications, not enterprise-class applications -- defined by very large production load within a given, well-defined set of environmental constraints (platform, tooling, contracts, separation of duties, strict SLA's and QoS). Scale is more than just adding more servers proportionally to the difference in load between a small application and a large one. There is a multiplier effect that has non-linear effects and as such the application, unless designed and deployed suitably, will simply not work. Unfortunately, that type of skill comes from a lot of hands-on experience -- good and bad -- that is difficult to come by. Even when that is in place, human errors, omissions and unduly external pressures can lead to bad situations.