Scaling services is a hard problem but operating them at the scale is even more harder. You need to consider the operational aspects of running it at scale during design time. Service downtime can cause huge losses to business and lead to poor customer experience.
Workloads can be unpredictable. Especially in case of multi-tenant systems, each tenant use case and the corresponding workload can be different. Tuning the service to one type of workload can cause severe problems to other workloads.
Modern services are designed as Micro services. One micro service consuming high resources can cause problems to other micro services. Careful tuning and monitoring is required to operate overall service. Sometimes one machine can have high load compared to other machines due to uneven distribution of load or due to configuration issues in DNS.
Commodity hardware has been used increasingly to deploy the services. Failures of different components is common in these environments. Detecting, self healing becomes critical to reduce the downtime. Failures can come from unexpected events also which are less frequent such as Datacenter power failure and Networking cable issues.
Faster response times are crucial for business to reduce the costs, gain competitive advantage and better user experience.
With the popularity of public clouds customers or devops are expecting easy to manage services and operate them at scale