Designing Hyper Scalable Services

Scaling services is a hard problem but operating them at the scale is even more hard problem. You need to consider the operational aspects of running it at scale during design time. Service downtime can cause huge losses to business and leads to poor customer experience.

Operation Challenges

Workloads

Workloads can be unpredictable. Especially in case of multi-tenant systems, or multi application systems each tenant or application use case and the corresponding workload can be different. Tuning the service to one type of workload can cause severe problems to other workloads. 

Resources

Modern services are designed as Micro services. One micro service consuming high resources can cause problems to other micro services. Careful tuning and monitoring is required to operate overall service efficiently. Sometimes one machine can have high load compared to other machines due to uneven distribution of load or due to configuration issues in DNS.

Reliability

Commodity hardware has been used increasingly to deploy the services. Failures of different components is common in these environments. Detecting, self healing becomes critical to reduce the downtime. Failures can come from unexpected events also which are less frequent such as Datacenter power failure and Networking cable issues.

Security

Security should be the highest priority for any kind of service. From the beginning, you need to design a strong security posture. This includes how you authenticate, authorize the users, how do you protect the data privacy, and also how you monitor the vulnerabilities. Threats are becoming common and immediate remediation and patching needed when vulnerabilities identified. Different countries have different privacy regulations and these needs to be considered at the design time. You may need auditing capabilities also if you are providing compliance.

Performance

Faster response times are crucial for business to reduce the costs, gain competitive advantage and better user experience. It is crucial to know the bottlenecks ahead by doing some kind of back on the envelop calculations. Simulations can help understand the system at large scale and identify the limits theoratically.

Capacity

Dynamically expanding and shrinking like public clouds is critical for modern services to reduce the operation costs of service.

Management

With the popularity of public clouds customers or devops are expecting easy to manage services and operate them at scale

Upgrades

It is difficult to upgrade the service at scale and the release cycles will tend to take more time as the service is in operation longer. You need to spend more time on testing and ensure the quality for each release so that you can have less impact when a upgrade is done. The upgrading testing should be run with some real world service data and different deployment configurations

Programmability

It is hard to release a patch for every small bug or improvement. If service can provide a well defined management or administrative API then it can be used to provide a workaround for the issues till the next release

Backward Compatibility

Between releases, API interface or data formats can change. Some of the functionality available in previous releases can be deprecated but users are depending on it. It is extremely important to consider the backward compatibility during design

Hardware Replacement

Commodity hardware will have certain time of life. Replacing the hardware impacts service availability. If the service is deployed in a redundancy configuration the impact can be minimized

Logging

Application logs are very important to investigate the production issues. There should be right amount of logging in each transaction or API processing to debug the issues

Statistics

You can not log each step of the processing. Even your logs are very detailed the logs can be rolled over due to the predefined thresholds. It is better to log the statistics in to a separate log file and also to expose them through an API or command. The Statistics can help in investigating the complex problems such as performance issues or resource issues.

QOS

When multiple applications,or users using the same platform, QOS will prevent noisy neighbor problems for fair sharing,

Tools

Tools provide various advantages to a service. They can provide a way to access some of the internal state and also provide a way to operate on the collected information. Some times they can provide a work around to provide a immediate relief on some operational issues.

With all the issues listed above, designing resilient services need to consider and plan for operational aspects during design time


RocksDB Java Native Memory Growth

RocksDB is developed in C/C++ programming language and has a JAVA binding which provides a set of JAVA classes to access RocksDB. In a Java application, the memory is managed by JVM. Object allocations are done on JVM heap and GC manages the heap. But RocksDB allocates C/C++ object using the OS malloc library. When RocksDB is used in a Java application we need to extra careful about the allocations made by RocksDB. 

For a Java process these are native memory allocations which are not on the heap. If Java process doesn't delete those objects or close the handles properly after use then they will remain and the JAVA process resident memory size (RSS) starts growing.  Even application is well written and properly handling the allocations, some time depending on workload, it may need to access several SSTables (RocksDB sst files) to serve client requests. This case cause uncontrollable growth if RocksDB not tuned properly  and can cause Out Of Memory and eventually lead to process crash by Kernel OOM killer. There is a lengthy discussion on the thread specified in Reference section which provide lot of details and how to monitor it and resolve it if you are experiencing this problem.



References

https://github.com/facebook/rocksdb/issues/4112