Instrumentation and system events telemetry in cloud based IoT systems

A few decades ago, when “digital” computer programming was in its initial days, programs were written using simple text editors and then these text files were given to the compiler as an input which then emitted an executable. Then the executable was finally run using command prompt/terminal. And what if the program had an error which was not caught during compilation? There came the idea of logging and tracing. Once your program was running, there was only one way to figure out the root cause of run-time errors — logs.

Fast forward to the 80s and 90s, we got integrated development environment (IDE) and later we got IDEs with GUI and a whole lot of features. With the advent of IDEs it became much easier to compile, run and debug programs. The logs were still important but not so much because most of the errors and issues were caught during debugging on an IDE.

Fast forward again, today, in the age of cloud based solutions — logs have become more important, again. I am not saying that we cannot debug cloud apps in IDEs, but we cannot debug these apps on high scale. And, when it comes to highly scale-able cloud apps — what works for a hundred may not work for a million. Also, not to forget the cloud based IoT applications, which are built to scale, to serve large number of requests at a time. In these scenarios — the only way to know what went wrong is to check the logs.

Let’s do an estimate — What we are looking at is a typical cloud based IoT solution. Let’s say a million devices are connected to it at peak time and each device is sending a hundred messages a day (which is a reasonable number, I believe). Now, let’s assume that each of these messages are being processed by the cloud solution. And let’s say in every message processing flow ten logs are generated on an average. Well, this is a billion logs a day. And that’s huge. Really huge.

So, we see that logging in cloud based IoT applications is really important and a must have. In this post, I am not going to talk about any specific cloud platform or any logging framework. What I want to focus on is what qualities your logging system should have. Following are a few of them which, I believe, are most important.

1. Partitioning — When you have a huge amount of logs, it becomes a real challenge that they should be easy to retrieve and query. To make the retrieval faster, partitioning can be helpful. No matter where your logs reside, they should be partitioned according to time. So that when you want to look at the logs you know which partition you need to retrieve. A simple example could be log files rolling every day/hour/minute depending on the load.

2. Correlation — Imagine a scenario when two messages from the same device are being processed at the same time, the logs generated will have the same time stamp and almost same meta data. It will be really hard to tell which logs are from which message processing flow. To resolve this, the logs should be correlated. Whenever a message processing flow starts, a unique identifier should be generated and passed to all the logs in that flow. There could be process and cloud service boundaries and this identifier should be passed across processes and cloud services. There can be several ways to achieve this — parent thread context, object lifetime, message queuing, etc.

3. Performance_— _This goes without saying that logging should not impact the performance of the application. One way to achieve that is asynchronous logging. There are chances of losing some logs in this approach but it’s worth it. The logs will first be written in a memory queue and then will be transferred to the sink in a separate thread. This minimizes the impact on the application flow. Example: NLog async target wrappers.

4. Scalability: When the load increases, when the devices increase or when the devices send more messages than usual, the logging system should not throttle. You logging system is the only way to figure out issues in the application so it should be the last thing to fail. Ideally, the logging system should be performance tested for at least twice the load needed. There are several cloud based telemetry and logging services readily available and each of them has a throttling limit. Either choose one that can take the amount of load that your application can produce or host a custom solution that can be scaled up and down as per the load.

So, we see that robust logging in cloud based IoT applications is not an easy task to achieve. It requires serious planning and proper implementation. Logs can be life saver in case of severe issues. But if the logs cannot be retrieved properly in time, or the logging system is not scale-able or performant enough, then it can be a nightmare to debug and fix a large IoT application which is already in production and not working as expected.