Looking behind, the Building Service Monitoring as a Service with an Eye on the Cloud series of blogs has covered a lot of ground - from why, to what, to how. Now it is time to look ahead to how we pump water through the pipes we have built. It is time to discuss how we become heroes to our customers - our application development and engineering teams. To become heroes to those constituents, we must understand what makes them tick.
Obvious point here - application development teams are made up of... developers. They create awesome apps that help our businesses capture new market share and/or reduce costs. They don’t get excited by the mundane but they do get excited by solving problems and by building what’s next. So it’s our job to help them focus more of their time on what they love to do and less of their time on what they don’t like to do. But when we do that, we have to do it in a way that makes sense to them in the modern world (i.e. we will never excite them by throwing humans at manual tasks).
It’s also important to mention that there are two approaches - the carrot and the stick - but with developers, the stick is not a sustainable method - we cannot use policy as a way to force them to be our partner because automation will always win. We must use the “carrot” approach where we provide real value to them; providing real value is the way to a true, sustainable partnership.
With respect to monitoring, we can help them in many ways including the following:
- Helping them think through the needs and solutions for their app or service. What do they need to monitor? What are the clever options to monitor?
- Helping them avoid writing code that is not adding value (i.e. writing “page me” scripts does not help them capture more market share for our businesses)
- Guiding them to instrument their code in a way that makes the best use of our Service Monitoring Service
- Enabling them to learn from their (and our) experiences so they improve everyday
- Acting as a knowledge and best practice broker amongst application teams because, if we have 10 app teams, each would need to have nine connections in a mesh-model. But in a hub/spoke model, each app team can have one connection back to us and trust us to share best practices and connect the dots at the right time.
Helping them think through the needs and solutions for their app or service
What is the goal of monitoring?
We discussed this topic early in the series—the goal of monitoring is the detection aspect of the incident lifecycle. If we do not know that we have impact, we cannot fix it quickly with humans or automation. MTTR (mean time to repair) will suffer. And, if we do not know that we have impact, we cannot communicate quickly which means MTTC (mean time to communicate) will suffer.
Often, we (and developers) get distracted when it comes to monitoring because many of the tools that exist in the industry do many things. Monitoring is about detection. There are other needs like availability reporting, capacity planning, security auditing and so many more. Often people lose track of the critical “detection” piece in the noise of the other, orthogonal pieces. Of course we cannot lose sight of those other critical needs but, in the same respect, we cannot lose track of detection.
Where should we start?
As an industry, we have focused mainly on failure mode alerting; most of the tools aim at that approach. We make ourselves feel better by having lots of rules but the reality is that we cannot predict every failure mode and, consequently, we cannot have rules for every failure mode. So is it better to have 10 well-thought-out-failure-mode rules or 100 rules that will be chatty? The reality is that the result of too many alerts is just as bad as the result of no alerts because noise risks the primary goal of monitoring (detection). Too often, we initiate outage bridges due to help desk calls and then open the monitoring console in hopes of finding an alert for the issue. It’s better to start with no rules and selectively add rules than the alternative which is to start with lots of rules and selectively tune them down. Unfortunately, the latter is what we have always done. It’s time to change the approach.
Even more importantly though, we need to take a step back and start with the service definition. What is the service? What are the server roles for the service (frontends, backends, middle tiers, etc.)? What are the service capabilities that the users consume (e.g. what will the end user tell us is broken when he calls the Service Desk?)? Once we understand the service, we can start to have meaningful conversations about monitoring. And, as an aside, we can provide a lot of value by helping the developers and engineers quantify their service definition in this way.
More blog posts in the Building Service Monitoring as a Service with an Eye on the Cloud series
Read the first blog post from Carroll Moon, Service Monitoring as a Strategic Opportunity.
Read the second post, The Future of Service Management in the Era of the Cloud.
Read the third post, One Team - One Set of Service Management Objectives.
Read the fourth post, Service Monitoring Service Outputs.
Read the fifth post, Service Management Outputs.
Read the sixth post, Building Trust in the Service Monitoring Service.
Read the seventh post, Making the Service Viral.
Read the ninth post, Monitoring Service Health.
Read the tenth post, Delivering the Service Monitoring Service.
Read the final post, The service monitoring service – rounding it all up.