The first aspect of Service Design is to define the functional requirements, capabilities, and so on. No matter what we are doing and no matter the industry, we should always start with the customer in mind. In the case of our Service Monitoring Service, our customers are the application and service teams that we want to consume our Service Monitoring Service. If the application teams consume our Service Monitoring Service rather than establish their own monitoring platform, we will need to deliver reliable and reusable commodity outputs including the following:
- Console User Interface (UI)
The console should be simple and responsive. It should allow for easy filtering by team, service, server role, capability, datacenter, and other configurable business dimensions.
- Automated Ticket Creation
- Ticket creation should allow for ticketing in whatever tool the application/service team is using. Of course, as discussed above, the utopian world would have all teams using a single tool, but reality proves that often that is not the case. We should not delay our progress while trying to reach utopia. That does not mean that we shouldn’t strive for utopia. Of course we want to push in that direction. However, we should not forego improvements while we wait on utopia.
- One of the reasons that we all want a single incident tool is to ensure that we have great data. The great news is that if all application/service teams use the central Service Monitoring Service, incident metadata will be centralized in our service. Someone might say “but you are building a Monitoring Service and not an Incident Service”. I would advise that person to reread the introductory sections of this blog.
- Another reason that we desire a single incident tool is to simplify escalations between different teams. If we make our service viral amongst our application teams, they will use the service for all of their escalations. At that point, our Service Monitoring Service will be wired to all of the application/service teams. The natural progression is to have all cross-team escalations flow back through the central hub Service Monitoring Service. Effectively, the Service Monitoring Service becomes the switchboard for all monitoring generated Incidents which moves us closer to that utopian world.
- Notification via Automated Email
- Experience shows us that many individuals and teams love email notifications. We will not likely change their minds any time soon. They want emails. Many teams have built business processes around email. We can fight against reality, or we can enable our customers. The Service Monitoring Service must support configurable email-based-notifications if we expect adoption of our Service Monitoring Service.
- NOTE: be cautious of recursive dependencies! We should not send email notifications via an email service that we are monitoring with our service. Because if the email service is down, we would alert but our email notification may not reach its recipients because of the email outage we are alerting about. We should use a decoupled SMTP email service that does not depend on our Service Monitoring Service for the outbound email notifications in order to minimize our risk.
- Notification via Short Message Service (SMS)
- As with email, many on call processes depend upon text messages, so we must enable the capability.
- The entry point investment is email-to-SMS. That approach is less complex than true SMS, and it is less costly, but it does introduce a bit of risk in that if the SMTP service that we are using for outbound email notifications is down, both our email notifications and our SMS notifications will fail.
- The higher-order option is to enable true SMS notifications. We would likely leverage a third party SMS service for that capability so that the third party would be the one dealing with country-specific regulatory stipulations of SMS.
- Notification via Automated Phone call
- Application teams that build their own monitoring always have a UI, they always create tickets, and they can send emails. So our outputs so far in this list (UI, email, and SMS) are commodity outputs. We need to provide differentiating business value for the application/service teams by delivering desired outputs that would be cost prohibitive for a single application team to deliver for themselves.
- Automated Phone call notifications will immediately gain interest from the application teams. Imagine the value of an on call engineer getting an automated phone call for a critical alert where the robot reads the alert title and allows the engineer to press 1 to acknowledge the alert or to press 2 to pass the alert to his backup.
- There are many third party services that can make these phone calls for our service. We do not need to build the phone call service. We simply need to choose a third party automated phone service. And then we need to wire our Service Monitoring Service to it via APIs. We should maintain the business logic in our Service Monitoring Service and simply pass instructions and the text that we want the robot to recite to the phone call service.
- Automated Crisis Bridge Establishment
- As with automated phone calls, automated crisis bridge establishment is another highly desired value add that will provide great business value while incenting the application teams to consume our Service Monitoring Service.
- Inevitably, we already have a list of conference bridge numbers that we use for Major Incidents. We all have these conference bridge lines in place today.
- Our Service Monitoring Service simply needs to uniquely assign one of our existing conference bridge IDs to each alert that meets the bridge criteria.
- From there, we would create a high priority Incident record for the appropriate application team in their Incident Management tool. We would send notifications via email, SMS, and/or phone call. All of those notifications would include the conference bridge details.
- We can even configure the phone call automation to allow the on-call engineer to join the conference bridge with the push of a button!
More blog posts in the Building Service Monitoring as a Service with an Eye on the Cloud series
Read the first blog post from Carroll Moon, Service Monitoring as a Strategic Opportunity.
Read the second post, The Future of Service Management in the Era of the Cloud.
Read the third post, One Team - One Set of Service Management Objectives.
Read the fifth post Service Monitoring Service.
Read the sixth post Building Trust in the Service Monitoring Service.
Read the seventh post Making the Service Monitoring Service Viral.
Read the eighth post, Service Monitoring Application Development.
Read the ninth post, Monitoring Service Health.
Read the tenth post, Delivering the Service Monitoring Service.
Read the final post, The service monitoring service – rounding it all up.