What is Service Health?
Now that we understand the service, we must ask ourselves what service health really is. For each service capability we must ask 1) is the capability working? and 2) is the capability performing well?
What types of monitoring might we consider?
We are all familiar with failure mode monitoring (disk is full, service is not running, memory is exhausted, etc.), so I will not belabour that type of monitoring here.
Next comes capability monitoring. Is each capability healthy? Is it healthy from my local network? Is it healthy from key user locations from inside my network? Is it healthy from the internet? Is it healthy from key user locations around the world via the internet? And so on.
Of course, we are all familiar with capability monitoring as well, something we usually tackle with synthetic transactions. And that is a fine approach but is there a better way? Or is there at least a complementary way? Synthetic transactions are great, but they are only as good as our volume, distribution and availability of our watcher nodes and some capabilities like taking money out of an automated banking machine, cannot be synthetically monitored for obvious reasons. I will come back to this topic in a moment (see “silver bullets” below)
Finally, we should consider analytics-based monitoring For a particular capability, what does normal usage (or other indicators) look like, for example, at noon on a Thursday? Are we within the statistical norm for that based on time-series-data? Are there patterns emerging from the wealth of service data (performance counters, capacity measures, change request information, normal and major incident information) that machine learning can identify for us?
All three types of monitoring are very important and it is our job to work with our application and engineering team customers to drive the right decisions now and in the future. For every capability, what is the cleverest way to know if the capability is working? For every failure mode, how do we alert while limiting false positives (noise) and no positives (monitoring misses)? How do we correlate alerts of each type to further eliminate noise. These are the conversations we need to be having. And these conversations will provide immense business benefit—value to our development and engineering customers. Notice how we have changed the conversation from one of monitoring tools and monitoring agents and monitoring scripts to a conversation of business value? That is our goal.
Are there any key points to help the developers be more successful?
Inevitably, their modern applications are instrumented already. They capture login failures in their logs. They capture database read retries and timeouts in their logs. They capture the information they care about...in their logs. That is why they immediately open up logs when they join major incident bridges. We want that information! We want that information in a way that easily wires into our Service Monitoring Service and we want that information in a way that does not require us to make changes on our side when they iteratively improve on their side (e.g. if they change the string in a log file, we would have to update our monitoring script accordingly). There are four parts to instrumentation-type-monitoring:
- Capture the data
- Set criteria to alert on data patterns
- Act on the event (alert, ticket, page, etc).
Developers already capture the data (#1) and we have become great at auctioning the events in our Service Monitoring Service (#4). Historically, we have tried to define the criteria (#2) for the alerts in operations, but we have established in this series that dev is better positioned to do that step because they know their application better than we do. Dev should own the criteria (#2) as part of the code and, if they own the criteria (#2), they should also write events in a way that is simple for us to consume (#3).
Let’s take an example. If we want to alert on active users dropping by X% in Y minutes, dev needs to log the data. Dev should own the criteria for X and Y, and they should be able to change the criteria in the code without a change request to the monitoring team. So, in this case, perhaps we agree with dev that “Active User Critical Drop” rule will be event ID 5555 then in the code, they can handle the rest. We will pick up any event ID 5555 and take the appropriate, pre-agreed action for that scenario. If dev started with 50% drop in 5 minutes and they desire to change it to 60% drop in 6 minutes, they can do that without a change request to the monitoring team - of course, they would take that change through Change Management but the point is that they do not generate work or delay by having to create a work item for the monitoring team.
Are there any silver bullets?
We have all been working in IT long enough to know that there is no such thing as a silver bullet. Everyone seems to look for tools that will solve all of their monitoring needs .The problem is that tools do not know the business. The value is in knowing the business, quantifying the business needs and their priorities and then applying the tools (and other clever solutions) to solve the need.
Even though there are no silver bullets, there is something that is an untapped opportunity. Historically, we have focused on thresholds; if the queue gets above 300 items, then alert. But at peak times (e.g. Monday morning when everyone signs in), the queue may be at 400 while the capability to ‘buy widgets’ is both available and performing efficiently. And, on non-peak times (e.g. weekend when nobody is working), we may be down for many hours before we cross the threshold. That is a poor monitoring rule As an industry, we have tried to solve that problem with rolling average targets, etc. The real answer is to understand what the thresholds should be in a time series (e.g. what is normal volume on a Tuesday at 1pm?), but that approach requires considerable investment and we need a solution right now. Sudden Change is an untapped opportunity. We do not really care that the queue is 400, what we care about is that the queue went from 150 to 400 in the matter of a few minutes. That is a sudden-change alert and it provides much more business value.
More blog posts in the Building Service Monitoring as a Service with an Eye on the Cloud series
Read the first blog post from Carroll Moon, Service Monitoring as a Strategic Opportunity.
Read the second post, The Future of Service Management in the Era of the Cloud.
Read the third post, One Team - One Set of Service Management Objectives.
Read the fourth post, Service Monitoring Service Outputs.
Read the fifth post, Service Management Outputs.
Read the sixth post, Building Trust in the Service Monitoring Service.
Read the seventh post, Making the Service Viral.
Read the eighth post, Service Monitoring Application Development.
Read the tenth post, Delivering the Service Monitoring Service.
Read the final post, The service monitoring service – rounding it all up.