Sign in

ITIL 4 and automation: Machine learning for user support White Paper

White Paper

ITIL 4 and automation: Machine learning for user support White Paper

White Paper

  • White Paper
  • Incident management
  • Service desk
  • Service management
  • ITIL

Author  Evgeny Shilov and Roman Jouravlev

February 20, 2019 |

 13 min read

  • White Paper
  • Incident management
  • Service desk
  • Service management
  • ITIL

Advanced analytics, including big data, machine learning, natural language processing and other technologies commonly known as artificial intelligence, have been developing for years now, promising increasing benefits for various industries and businesses. However, the adoption of these same methods to support IT management has been slower than one might expect.

In August 2017, Andrew Lerner, a research vice president at Gartner, wrote in his blog:

‘We were hearing from many IT ops leaders building incredibly sophisticated Big Data and Advanced Analytics systems for business stakeholders, but were themselves using rudimentary, reactive red/yellow/green lights and manual steps to help run the infrastructure required to keep those same systems up and running. Further, we’re all now familiar in our personal lives with dynamic recommendations from online retailers, search providers, virtual personal assistants, and entertainment services. Talk about a paradox!

'''Now I wouldn’t say everything has changed since that time, but a lot has, and for the better. Since then, we have seen a number of exciting developments including a rapid shift in the acceptance of and interest in applying a broad spectrum of Artificial Intelligence (AI) capabilities to enterprise IT operations management challenges.

'As a result, we introduced the concept of AIOps (originally called Algorithmic IT Operations, now Artificial Intelligence for IT Operations) as a means to describe growing interest and investment in these technologies.’1

Since then, many vendors have started introducing solutions for Artificial Intelligence for IT Operations (AIOPS), and this has become a common trend in IT management automation. AIOPS is now included in the scope of market analysis performed by leading industry analysts, and it is becoming a hot topic on the agenda of many IT leaders.

This is how Gartner defines AIOPS:

‘AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight. AIOps platforms enable the concurrent use of multiple data sources, data collection methods, analytical (real-time and deep) technologies, and presentation technologies.’2

As with many other solutions, AIOPS is often perceived as a magical silver bullet that would remove the need for human intervention in IT operations and support, and make the right things happen automagically. Moreover, as with many other silver bullets, it is expected to be available out-of-the-box, and the box is expected to be shiny and expensive.

In practice, this is not the case. If organizations follow the ITIL guiding principles, such as ‘start where you are’, ‘progress iteratively with feedback’ and, most of all, ‘focus on value’, they are unlikely to jump to a different IT management tool every time a vendor offers a shiny new piece of functionality, however well it is marketed. In fact, most organizations are not even able to make the jump from one tool to another. ITSM automation is often a complicated landscape, and migrations are complex and expensive projects themselves. This does not mean, however, that the opportunities provided by emerging technologies should be ignored.

This paper describes a small step forward in utilizing machine learning, a form of automation, for user support. The paper outlines a real case study, which may be useful for IT support practitioners considering their first steps in AIOPS, without a huge budget or revolutionary changes in the ITSM automation architecture. The solution described in the paper has been developed by the Cleverics team for a client and continues to evolve.

1 https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms/

2 https://www.gartner.com/document/3263717

User support: challenges of automation

The case study presented here will focus only on using machine learning for user support. In this case, user support refers to the processing of user tickets (in the form of calls, emails, portal records, etc.), initial categorization, routing and resolution. These activities may fall within the scope of one or several of the ITIL practices, including, but not limited to:

  • service desk
  • incident management
  • service request management
  • change control
  • information security management.

Initial contact and data collection has always been a vital step for both users and service providers, as these activities define how good the user experience will be, and provide specialists with necessary information about the support case. Historically, the main channels for initial contact were typically phone calls. In their conversations with service desk agents, users were led through:

  • initial data collection
  • classification
  • initial support
  • supplementary data collection
  • routing to a specialist group.

Call-based support can be effective, and it brings many benefits, but it may also become expensive and inefficient, especially when processing a large amount of simple standard cases. Attempts have been made for years to direct some (or all) support tickets to other channels, such as email or a user portal. These attempts have met with various levels of success, depending on many factors, such as the users’ qualification, variability of requests, quality of processing algorithms, etc.

Although it is not easy to automatically process email text for classification and data collection, it is possible to automatically route emails to specialist groups. The most common mechanisms are routing based on the email address and subject line used. So for example, all emails addressed to SAP-HR@company.com, or sent to another email address but referencing SAP-HR in the subject line, would be automatically routed to the SAP HR support group. Obviously, neither of these approaches would filter out emails incorrectly addressed or categorized by users, or emails with insufficient information.

In general, email as a support channel has limited applicability, and often requires a call back from the service provider, which brings us back to phone support, just with a slower and more complicated queueing system. The main reason for this is a lack of ability to automatically process the natural language of emails and make sense of them.

This issue is easier to solve with portals, where guiding forms and templates with limited options help to delegate a significant part of the classification and data collection work to users. However, users may not like this experience, and the quality of the information provided is not always sufficient.

Image Figure2.1 showing What we can do via different support channels


Figure 2.1 What can we do via different user support channels

To increase the quality of the data collected, service providers often burden users with complicated and lengthy ticket forms, thus putting user satisfaction at risk. In the worst scenarios, users don’t bother to fill in the form in and turn to some unsupported method of contact that cannot be properly tracked and used to provide data.

In this situation, automated classification based on natural language text may become a solution that will improve user experience with portals and increase the applicability of email as a support channel.

Machine learning: implementation approach

So, we have decided to automate ticket classification based on the natural language of the tickets. To do this, we needed to create:

  • a dataset to teach the system
  • a set of categories and rules to be used by the system
  • a classification model.

Image of Figure 3.1 showing Key implementation steps and rresults


Figure 3.1 Key implementation steps and results

STEP 1. PREPARING THE LEARNING DATA SET

One of the key pre-requisites for machine learning is a dataset of sufficient size and quality. In this case, this means a database of user tickets covering the full range of topics, routing options, categories and expressions that are used to describe issues and requests, including acronyms and acceptable jargon. In a multi-lingual environment, this should be available in all languages in use.

It is vital for machine learning to have an initial data set of high quality. This is impossible if the quality of records is neglected in the current practice of the organization. No machine can learn valuable rules for future data processing based on incomplete and incorrect past data.

One issue we faced with the quality of data on user tickets was the problems caused by re-categorization. When a ticket is assigned to a specialist group, it is usually done based on a category (an output of classification). If this group re-assigns the ticket to another group, it is rarely re-categorized, so many tickets have a wrong ‘category-group’ combination, which may confuse the classification system or even teach it some incorrect things. Note that this is a process issue, not a technical one. Sometimes to create and collect a dataset that is good enough, you need to improve your processes first. These improvements are not just for initial learning but will help to continually feed useful data into the classification system, and so improve it. That is why cleaning past data is not enough, even if it is necessary. In our example, improvements were needed in the procedures for manual classification and in the categories catalogue.

STEP 2. TEACHING THE SYSTEM

Once a data set has been prepared, there are two main approaches to using it to teach the system:

  • ‘learning with a teacher’, when categories and rules for classification are created or suggested by a human. For example, we can tell the system that all messages including such expressions as ‘SAP HR’, ‘leave request’, ‘mandatory training’, ‘maternity leave’ or ‘resignation’ should follow the process defined for SAP-HR support.
  • ‘learning without a teacher’, when categories and/or rules are derived by the system based on the trends found in the data set. For example, the system can find that the majority of tickets with the expressions listed in the first point were solved by the SAP-HR expert group in the past, so it can suggest or create a rule for future routing of all new tickets with the same expressions.

Either (or both) of these approaches can be used to create the classification model. Initial teaching is iterative, with multiple cycles including testing, detection of bottlenecks, refinement of the rules and new testing. It is also important to make the improvement of the model a continual process, as it will be developing based on new data. Similar to humans, it is likely that the system will require a teacher for initial learning, and, if successful, will continue to learn without one.

STEPS 3 AND 4. TESTING AND ANALYSIS

After the system is taught, it needs to be tested. As with most tests, this should use representative sets of tickets to ensure that the classification predicted by the system is correct, with an acceptable level of accuracy for all types and forms of tickets. It is a good idea to set up thresholds for ‘acceptable accuracy’. For example, we decided that some tickets may be routed automatically, whereas others may require manual confirmation. Lastly, where the level of accuracy does not allow for an exact classification, the system may suggest a narrowed list of options for manual selection (see Fig. 3.2). In some cases, the next step of the support process may be to request more information from users by means of a specialized form. This could still be an improvement in both user experience and data quality depending on how relevant the form is.

Testing may highlight issues with the learning dataset, classification rules, or both. For example, similar requests may be categorized incorrectly by both humans and the system due to vague and ambiguous categories. We experienced this with such categories as ‘password reset’ (often confused with ‘block or unblock account’) and ‘mailbox creation’ (confused with ‘new account creation’). It is vital to clear issues like these before starting operational use of the system, as incorrect classification at the start of the process will compromise the credibility of the approach in general. This could lead to people avoiding use of the system in the future, or continuing to re-confirm classifications manually even after the system becomes sufficiently correct.

Image of Figure3.2 showing example thresholds for use of automated classification



Figure 3.2 Example thresholds for use of automated classification

Machine learning: the architecture behind the magic

We keep referring to the classification engine as a ‘system’ in this paper, but in a real-life situation it is important to understand exactly what the system is. As mentioned at the beginning of this paper, many ITSM vendors offer this or similar functionality within their ITSM tools, and users should always check with their vendor or consultant before attempting to create a system of their own. In our case, no out-of- the-box solution was available, so we had to look at the open market. Fortunately, machine learning here has already achieved a state where minimum expert knowledge is required to create a working solution. These days this may not be as easy as building a simple website, but it is still a useful analogy. As with building a website, technical knowledge is less important than a good understanding of the architecture and business logic required. There is no need to create a machine learning engine, as these are already available, often for free.

Essentially, there are a few key components that are needed to enable a machine learning-based approach for ITSM automation:

  • Your ITSM tool should be able to exchange data with external systems (most of them are)
  • There should be a sufficient volume of high-quality data to teach the system, however, this is usually a challenge
  • You will need a neural network computation library for machine learning, such as TensorFlow3, CNTK4, or Theano5.
  • If you prefer not to delve into the technical aspects of the library, consider adding a high-level API to your architecture, such as Keras6 or TFLearn7 (We used Keras running on top of TensorFlow in our project).
  • You will still need some rules to process the outputs of the neural network. For example, the rules illustrated in Fig. 3.2 are applied in the ITSM tool based on the analysis of veracity of the predicted classification.

3 https://www.tensorflow.org/
4 https://github.com/Microsoft/cntk
5 https://github.com/Theano/Theano
6 https://keras.io/
7 http://tflearn.org/

Machine learning: continual improvement

The architecture and approach described here are universally applicable to many areas of automation in ITSM. Within AIOPS, they may apply to incident resolution (finding solutions in the knowledge base, or even service self-healing), early problem detection and trend analysis, evaluation of change impact for proposed and implemented changes, and many other areas. Depending on the context, source data and classification rules will differ, as well as the resulting actions, but the key principles will remain the same:

  • focus on value: machine learning is a tool, not the goal
  • start where you are: consider data quality
  • progress iteratively with feedback: machine learning should be continual
  • keep it simple and practical: user experience will define the adoption
  • collaborate and promote visibility: ensure sufficient testing and awareness
  • think and work holistically: understand how the approach applies to the organization in general
  • optimize and automate: this is the main result that machine learning is trying to achieve.

ITIL principles guide any improvement initiative, including the ones aiming to automate service management. For practical recommendation on this matter, refer to ITIL Guiding Principles for Continual Improvement8. However, ITIL addresses the growing role of intelligent automation in service management in many other ways. Refer to descriptions of ITIL practices9 for automation recommendations and check section 3.2 of ITIL Foundation for an overview of the information and technology dimension of service management.

8 https://www.axelos.com/guiding-principles
9 Available from the second half of 2019

About the author

Evgeny Shilov, Director of Consulting at Cleverics. Experienced ITSM practitioner and consultant with a blend of technical and management expertise. ITIL Expert. https://www.linkedin.com/in/evgeniyshilov/

Cleverics, education and consulting in IT Management. 900+ companies and 15 000+ delegates have already benefited from our knowledge and experience. https://cleverics.com

Roman Jouravlev, Portfolio development manager at AXELOS, lead architect of ITIL 4. ITIL Expert. https://www.axelos.com/news/blogs/june-2017/meet-the-itil-team-roman-zhuravlev

ITIL 4 and automation: Machine learning for user support White Paper