Throughout the "Building Service Monitoring as a Service with an Eye on the Cloud" blog series, we have focused on the need for a Service Monitoring Service - a “monitoring hub” of sorts. This approach is different from the legacy centralized monitoring approach whereby the central team owns both the platform and the rule logic. This approach is also different from the completely decentralized approach of every application team doing their own thing and duplicating effort.
If we deliver the Service Monitoring Service (hub) correctly, we will help the development and engineering teams avoid writing code that does not provide business value. For example, distracting a development team that is responsible for writing code to sell more widgets by having them write code to page themselves when there is an issue with their service is counterproductive to the business goals of capturing market share and of reducing costs of doing business. The Service Monitoring Service’s purpose is to enable the application development teams to focus on their charters.
Guiding application teams to instrument their code in a way that makes the best use of our Service Monitoring Service
We have already discussed that the development and engineering teams are already instrumenting their code. However, it is usually not instrumented in a way that a) is easy for our Service Monitoring Service to consume b) allows the logic for alerting to remain with the application instead of in centralized monitoring rules and c) is not thought of in terms of ‘monitoring’ - the developers are logging information for root cause analysis, trending, etc.
Once we define our Service Monitoring Service (service definition, quality targets, inputs, outputs, etc), we can move on to the more strategic work of consulting with the development and engineering teams on how best to integrate with our service monitoring service. Earlier in this series we discussed that it is important for ops to have development expertise. This point takes that requirement to another level. Not only do we need to write code for our service monitoring service, we need to be able to guide our customer application development and engineering teams in how to write/update their code for monitor-ability.
Enabling them to learn from their (and our) experiences so they improve everyday
As operations people, one thing that we take pride in is our experience. Oscar Wilde said “experience is simply the name we give our mistakes”. We are a passionate group because we have learned valuable lessons the hard way, and our experience can be very valuable to development and engineering teams IF we communicate in the right ways. We all know the value of problem management but we do not need to try to impress the development teams by rattling off terms like error control and major problem reviews. Most developers will ignore our rants but will listen to us if we talk about scenarios. We should have discussions and we should ask questions. With respect to monitoring, we might ask the following simple questions:
- For incidents for your service over the past six months, how quickly was each caught by monitoring?
- For those incidents which service capabilities had the slowest time-to-detect?
- Do you believe it would be worthwhile to review each incident to capture a few key data points so that we can trend that information in the future? That way, you can prioritize your instrumentation investments?
Bingo! We’ve just had a discussion about the value of major problem reviews, trending and targeting preventative action without making the developers run away. If we continue the conversation, it’s likely we can drive forward to a world where we provide additional value to the development team by helping them review their incidents and enable the capture of the data. We do likely have problem management tools, don’t we? We can provide further value by driving the review discussions with the development teams. We can provide more value by helping them quantify, prioritize, and track the actions from the reviews and we can provide even greater value by putting our consulting hats on and helping the development team think through clever ways to accomplish resolution for each action item. And...
Acting as a knowledge and best practice broker amongst application teams
...given that we sit at the centre of all of these application and engineering teams, we can provide even greater value by bringing the best practices together. If we have 10 application teams, each would need to have nine connections to adequately share best practices. That is unrealistic and it is a waste of resources. What if we do a great job of providing the hub service monitoring service? What if we are providing the deeper consultative value that we just discussed to every team? Won’t we recognize patterns and best practices that we can share that will provide even more value? For example, “Application Team 2, you just had an incident where your time-to-detect was 52 minutes. Application Team 1 had a similar issue three months ago and we quickly plugged the gap by doing ABC. Let me set up a quick 30-minute meeting with everyone to help you move faster.": THAT IS PRICELESS VALUE, and that is our goal.
From a monitoring perspective, we should do all of these things to help our application and engineering teams - our customers. We should start small, and we should build the partnerships by delivering real value to our customer teams.
We have spent a lot of time together throughout this series. We have one more hill to climb, and that is to answer the following question: “Can perfect monitoring still end in failure?” Happy thinking!
More blog posts in the Building Service Monitoring as a Service with an Eye on the Cloud series
Read the first blog post from Carroll Moon, Service Monitoring as a Strategic Opportunity.
Read the second post, The Future of Service Management in the Era of the Cloud.
Read the third post, One Team - One Set of Service Management Objectives.
Read the fourth post, Service Monitoring Service Outputs.
Read the fifth post, Service Management Outputs.
Read the sixth post, Building Trust in the Service Monitoring Service.
Read the seventh post, Making the Service Viral.
Read the eighth post, Service Monitoring Application Development.
Read the ninth post, Monitoring Service Health.
Read the final post, The service monitoring service – rounding it all up.