Scaling services is a hard problem but operating them at the scale is even more hard problem. You need to consider the operational aspects of running it at scale during design time. Service downtime can cause huge losses to business and leads to poor customer experience.
Operation Challenges
Workloads
Workloads can be unpredictable. Especially in case of multi-tenant systems, or multi application systems each tenant or application use case and the corresponding workload can be different. Tuning the service to one type of workload can cause severe problems to other workloads.
Resources
Modern services are designed as Micro services. One micro service consuming high resources can cause problems to other micro services. Careful tuning and monitoring is required to operate overall service efficiently. Sometimes one machine can have high load compared to other machines due to uneven distribution of load or due to configuration issues in DNS.
Reliability
Commodity hardware has been used increasingly to deploy the services. Failures of different components is common in these environments. Detecting, self healing becomes critical to reduce the downtime. Failures can come from unexpected events also which are less frequent such as Datacenter power failure and Networking cable issues.
Security
Security should be the highest priority for any kind of service. From the beginning, you need to design a strong security posture. This includes how you authenticate, authorize the users, how do you protect the data privacy, and also how you monitor the vulnerabilities. Threats are becoming common and immediate remediation and patching needed when vulnerabilities identified. Different countries have different privacy regulations and these needs to be considered at the design time. You may need auditing capabilities also if you are providing compliance.
Performance
Faster response times are crucial for business to reduce the costs, gain competitive advantage and better user experience. It is crucial to know the bottlenecks ahead by doing some kind of back on the envelop calculations. Simulations can help understand the system at large scale and identify the limits theoratically.
Capacity
Management
With the popularity of public clouds customers or devops are expecting easy to manage services and operate them at scale
Upgrades
It is difficult to upgrade the service at scale and the release cycles will tend to take more time as the service is in operation longer. You need to spend more time on testing and ensure the quality for each release so that you can have less impact when a upgrade is done. The upgrading testing should be run with some real world service data and different deployment configurations
Programmability
It is hard to release a patch for every small bug or improvement. If service can provide a well defined management or administrative API then it can be used to provide a workaround for the issues till the next release
Backward Compatibility
Between releases, API interface or data formats can change. Some of the functionality available in previous releases can be deprecated but users are depending on it. It is extremely important to consider the backward compatibility during design
Hardware Replacement
Commodity hardware will have certain time of life. Replacing the hardware impacts service availability. If the service is deployed in a redundancy configuration the impact can be minimized
Logging
Application logs are very important to investigate the production issues. There should be right amount of logging in each transaction or API processing to debug the issues
Statistics
You can not log each step of the processing. Even your logs are very detailed the logs can be rolled over due to the predefined thresholds. It is better to log the statistics in to a separate log file and also to expose them through an API or command. The Statistics can help in investigating the complex problems such as performance issues or resource issues.
QOS
When multiple applications,or users using the same platform, QOS will prevent noisy neighbor problems for fair sharing,
Tools
Tools provide various advantages to a service. They can provide a way to access some of the internal state and also provide a way to operate on the collected information. Some times they can provide a work around to provide a immediate relief on some operational issues.