Problem management at the DWP
- Case Study
- Problem management
May 15, 2021 |
14 min read
- Case Study
- Problem management
The UK government’s Department for Work And Pensions has been improving its problem management processes for years. They have used ideas from a range of service management frameworks, including ITIL 4, to optimize their ways of working and cultivate a effective problem management culture.
Read this case study to find out what changes they’ve made, and why.
1.1 The Department for Work and Pensions
The Department of Work and Pensions (DWP) is the UK’s biggest public service department. It administers benefits to around 20 million claimants and customers. This involves assessing claims, making and measuring benefits payments, and manning job centres and benefit centres. All these activities depend on various forms of IT. The DWP uses about 300 different technical services, each of which is managed by a dedicated service owner. Service owners act as a liaison between customers and operations, and they can call on technical teams as needed.
Service owners and technical teams are also responsible for problem management. They have the knowledge and expertise needed to investigate, diagnose, build, and implement fixes for problems. Technical teams investigate and present options to service owners on possible solutions to problems; service owners make the decisions.
1.2 The digital problem management team
Historically, the majority of the DWP’s IT service provision was outsourced to external suppliers. Service management was also outsourced, and problem management was part of a SIAM function that acted as a layer between the DWP and external suppliers.
Then in 2015, the DWP changed the way it manages IT services. The various outsourced supplier contracts were progressively insourced. However, the overall approach to problem management remained largely similar to the pre-2015 approach. External supplier technical teams became the DWP internal technical teams. SIAM problem management became the DWP digital problem management team (DPMT), and the DWP live service support teams evolved under the direction of a service owner.
Now, the DPMT generally does not manage individual problems. Rather, the team oversees the whole process to ensure that problems are raised and managed effectively. The DWP estate is extremely large and diverse; there are many services and many of those include distinct strands; for example, within user-facing functions, such as child maintenance and working age benefits. It makes sense, therefore, that the experts for each service manage their own problems.
“Organizations should understand the errors in their products because they may cause incidents and affect service quality and customer satisfaction. The problem management practice ensures problem identification and thus contributes to the continual improvement of products and services.” ITIL problem management practice guide, section 2.4.1
Sometimes, the DPMT will intervene in the management of specific problems, such as those having a major impact, those that need more traction and focus, or those that are difficult to diagnose. The DPMT might also facilitate cooperation between several service owners and technical teams who are trying to diagnose a problem.
“When problems have been identified, they should be handled effectively and efficiently. It is rarely possible to fix (remove) all the problems in an organization’s products, but identification without resolution is significantly less valuable for the organization and its customers.” ITIL problem management practice guide, section 2.4.2
The service owners and technical teams who manage problems also manage incidents, changes, knowledge, and so on. From a practical perspective, they have a full end-to-end control over their services and the processes to support them. This helps to drive quality; for example, it is in service owners’ interests to ensure that any changes are implemented safely because they will each have to manage the consequences of unsafe change.
The evolving demands on Problem management
The role of problem management has been evolving rapidly as new technologies emerge and ways of working change. The main driver for this evolution at the DWP was the insourcing strategy and the decision to move away from external suppliers. The strategy was to develop in-house digital services and at the same time to take advantage of emerging technologies, such as cloud services.
This change in strategy, the increasing use of agile methodologies, and introduction of site reliability engineering at the DWP provided an opportunity to evolve the processes and to reaffirm the value of problem management within the organization.
So, there has been a fundamental shift in the way services are managed. For example, pre-2015 external suppliers had to abide by detailed problem management policies and procedures, which were agreed between SIAM problem management and the DWP.
The policies and procedures tended to be very prescriptive; they described the process in detail and included statements about what suppliers must or must not do. An escalating scale of penalties for repeated non-compliance were included in suppliers’ contracts.
Post-2015, the insourced suppliers are not governed by commercial contracts. Instead, they have an internal obligation to follow the policies and procedures. There are no penalties for non-compliance. This meant that the DPMT had to redesign their interactions with the technical teams, service owners, and suppliers.
One of the first things the DPMT did was re-evaluate the purposes and efficacy of the policies and procedures (in part prompted by engagement with agile development teams). The team reworked the policies and procedures into a framework document that provided the minimum documentation necessary.
“Many problem management activities rely on the knowledge and experience of staff, rather than on following detailed procedures. People responsible for diagnosing problems often need the ability to understand complex systems, and to think about how different failures might have occurred. Developing this combination of analytical and creative ability requires mentoring and time, as well as suitable training.” ITIL® Foundation: ITIL 4 Edition, section 5.2.8
The new framework documentation concentrates on communicating the value of problem management, the key guiding principles, and the team’s aims and objectives and how to achieve them. Prescriptive ‘must’ statements were removed. In all, the team reduced 11 process documents down to a single framework document.
The new framework provides just enough information that stakeholders can understand the principles of the process and implement it in their teams.
To meet the need for more detailed information and to provide additional guidance and support, knowledge articles have been created in the service management toolset. All the information that was in the policy and procedure documents is still available, but it is now easier to locate and kept separate from the essential framework.
One of the implications of the new framework is that those following the framework must have a good understanding of the role of problem management to be successful. As such, the DPMT’s role is evolving from one of enforcement (pre-2015) to one of training and upskilling (post-2015).
Problem management in the DWP
3.1 Service management methodologies
The DWP bases its service management upon ITIL® principles but recognizes that other methodologies can help support the delivery of services. These include agile methodologies, SRE, and so on, which provide a framework for organizations to consider or adopt.
“Most of the content of the practice guides should be taken as a suggestion of areas that an organization might consider when establishing and nurturing their own practices.” ITIL problem management practice guide, section 7
The key to getting the most out of these methodologies is staff awareness. The DWP recognizes the need to have a fully trained and skilled workforce and fully supports training and development opportunities in these areas. All methodologies have their strengths, and inspiration can be drawn from all of them.
However, awareness alone is not enough. The team is encouraged to utilize their learning and translate theory into practice.
The DPMT do this by recognizing that methodologies are fundamental underpinnings of what is done: they do not specify how it is done in detail. By understanding the implied objectives of the methodologies and focusing on what can be improved, the DPMT have been able to evolve their processes and develop specific improvements, such as real-time root cause analyses (RCAs).
The ability to evolve, improve, and add value is due to the commitment of the team and a deliberately cultivated positive environment in which employees are encouraged to listen to stakeholders, try new ideas, and learn continually.
3.2 Training and awareness
The DPMT conducted a survey to find out what the problem management stakeholders thought of problem management, as well as what skills and training they needed. The survey made it clear that many people were confused about the practical application of theories. Many people also wanted to learn more about specific topics, such as root cause analysis techniques.
As a result, the DPMT has developed and continues to develop a range of training opportunities for stakeholders. Some of the most useful delivery methods are:
- Bitesize sessions These focus on a specific topic, such as ‘what is a problem?’ They were originally part of a dedicated ‘learning at work’ week, but they have since been turned into videos that are always available.
- Face-to-face sessions These sessions are useful for complex topics, and they have the added benefit of enabling direct contact with stakeholders.
- Knowledge articles These are an ‘always there’ resource that allow people to self-help. However, sometimes they are best used to support more direct training. Articles provide targeted information, such as advice on specific parts of the process.
3.3 The Problem process
The problem process in the DWP follows ITIL principles. The team objectives include minimizing the number of incidents (actual and potential) and their impact, such as through workarounds.
A cause, or potential cause, of one or more incidents.
In the DWP, problems are raised for IT faults and for people and process issues. The focus is on improving and plugging gaps in processes, identifying training opportunities, disseminating information, and applying technical fixes.
A general principle is that problems are raised not only when the root causes of the incident are unknown, but also when the root causes are known and further actions are needed to fix or mitigate a risk. Problems should be documented so that incidents can be linked to them to help assess their overall impact. In fact, there are very few circumstances when a problem would not be raised if incidents are being generated.
A problem record promotes the visibility of the issue and allows it to be managed in a single space. Without the problem record, issues would be overlooked. Decisions and reasoning can be recorded in the problem record and revisited at a later date, regardless of whether or not a problem is fixed.
Key message: Problems are positive
Without a problem record, issues are never resolved, meaning incidents are never minimized or eradicated.
3.3.1 the problem lifecycle
Problems in the DWP progress through a problem management lifecycle. The lifecycle has six states:
- under investigation
- investigation complete
- fix identified
- fix scheduled
- fix implemented
- fix successful.
The lifecycle allows the team to assess how long it takes to complete an investigation or implement a fix, which means they can identify bottlenecks in the process. Targeted actions can then address process failings. Workarounds or mitigating actions can be introduced at any stage in the lifecycle and will be reassessed and improved as knowledge of the problem develops.
Ideally, all problems would progress through the lifecycle to the ‘fix successful’ stage. However, in many cases the team will decide early on not to progress a problem. In these cases, the problem is closed early.
“The problem management practice includes the following practice success factors:
- identifying and understanding the problems and their impact on services
- optimizing problem resolution and mitigation.” ITIL problem management practice guide, section 2.4
3.3.2 Closing problems
Problems are raised and worked on until a decision is made to close them by the service owner. Reasons for closure include:
- The problem has been fixed and will not reoccur.
- Action has been taken to mitigate the problem’s impacts in the form of a workaround.
- Some other action had been implemented, such as improved training or a process improvement.
- No action is taken; the problem is accepted as a risk.
The reason for closure is recorded using a closure code. However, closed problems are still important because incidents may still be linked to them.
3.4 Reporting on problem management KPIs
Defining effective KPIs is very difficult; focusing on individual KPIs does not provide enough context about the performance of the team, as the following example illustrates.
At one time in the DWP, suppliers had to propose specific problem fixes within a certain amount of time after raising a problem. The intention was to encourage suppliers to propose fixes quickly. In practice, teams would delay raising a problem record until a fix had been identified. The problem record would then be raised and the KPI would be met.
“In many cases, using single-system-based metrics as targets can result in misalignment and a disconnect between service partners regarding the success of the service delivery and the user experience […] This is referred to as the ‘watermelon SLA’ effect.” ITIL® Foundation: ITIL 4 edition, section 126.96.36.199
The lesson is that focusing on metrics and managing by numbers can be very dangerous. Negative behaviours can develop, undermining the original intention to improve the process.
Now, the DPMT treats KPIs cautiously and avoids arbitrary targets. They also balance complementary KPIs without placing overreliance on any one. The problem record is emphasized as a good thing; a low number of problems is not a measure of success if the number of incidents remains high and incident resolution is low.
“Metrics will be based on the overall service strategy and priorities of an organization, as well as on the goals of the value streams to which the practice contributes.” ITIL problem management practice guide, section 2.5
3.5 Real-time root cause analysis
The approach taken towards root cause analyses at the DWP is another example of how the ways of working have changed since 2015.
Before, when a major incident occurred the teams responsible would be given a root cause document to complete. Completing this document was seen as a punishment for an incident.
Now, all parties involved in a major incident attend a post-incident review in which root cause analysis is performed in real time. The main advantages of this method are:
- There is a no-blame culture: the objective of the exercise is to identify root causes and what can be done to prevent or mitigate them. No one is interested in assigning blame.
- Participants are more willing to offer information and facilitate the process, even if that means owning up to mistakes.
- The quality of the resulting document is noticeably high.
- The time taken to complete the document was reduced from weeks to one or two days.
“Where reporting identifies failures or weaknesses, these must be viewed as opportunities for improvement. Employees who know they will be blamed when measurements uncover undesirable results will often mask problems, making it very difficult for the organization to correct them.” ITIL® 4: Direct, Plan and Improve, section 4.2.1
To perform root cause analyses in real time, you need:
- someone conversant with the five-whys methodology to chair and guide the analysis
- a scribe to document the analysis
- a way of sharing the live analysis document as it is being written.
3.6 Problem communities
The concept of problem communities illustrates how far problem management has evolved. Each service has a number of stakeholders who are interested in its performance. Stakeholders include customers, other service owners, technical teams, or even the service desk (who have to deal with the fallout of problems).
These stakeholders need to know about any problems associated with that service and what, if any, impact those problems will have on them. This information could be shared in a report, but problem communities go further. At the DWP, when a problem is raised or reaches a critical point in the lifecycle, its impact is communicated to the problem community stakeholders, who are invited to respond with any questions or queries, which are then answered.
The service owner responsible for problem management can then draw upon diverse stakeholder expertise; for example, asking ‘is this workaround acceptable?’
3.7 Relationships with other process areas
The DPMT does not operate in isolation. The team has a mature set of processes that have grown and developed on two levels: there is a relationship between key process teams, such as the incident, change, and knowledge management teams; there is also a relationship between the various service owners and technical teams.
3.7.1 Key process teams
The DPMT works very closely with the incident management team to ensure the two processes are complementary and support each other. For example, the teams hold a monthly incident and problem review forum that allows the incident and problem stakeholders to discuss and resolve any issues involving the two processes. This encourages information sharing and collaboration within the direct community.
“Relationship management seeks to harmonize and synergize different organizational relationships with internal and external customers to realize targeted benefits through continual improvement.” ITIL® Foundation: ITIL 4 Edition, section 5.1.9
Knowledge has always been a part of the DPMT’s activities, but the creation of a dedicated knowledge management team helped to formalize how and when knowledge is created. The consideration of knowledge is now integral to each stage of the problem lifecycle.
The DPMT also has close relationships with other areas, such as change management, to be able to manage problem fixes and also identify changes that introduce new problems. Another example is service transition, because it is important that new services have full problem support in place from the beginning.
Many of the things introduced and developed in the problem management space at the DWP are the result of many months or years of effort and change. Many of the ideas and ways of working that have been introduced, such as real-time root cause analyses and the problem lifecycle, have benefitted from the team’s perseverance and the open and collaborative culture.
The DPMT have been involved in problem management for several years. They have built their expertise and strongly believe in the value of problem management. They regularly identify opportunities to evolve and improve.
Problem management at the DWP has come a long way. Upskilling and promoting problem management within the wider organization is a key part of the team’s strategy. By giving stakeholders the skills and knowledge they need to understanding of the benefits of problem management, the process as a whole is carried out more effectively.
KPIs and measures are not the team’s primary focus. Although such measures exist, the focus is on actually managing problems for the right reasons. This approach results in automatic positive results. The main takeaway is that problems are positive!
About the authors
Rosie has been a service management professional for over 15 years in the public and private sector.
Rosie has worked in various IT and operational roles within the DWP and has been the DWP problem process owner for the last 10 years.
Stephen has been a problem management team leader with the DWP for the last 13 years. Stephen has over 20 years’ IT experience spanning academia and operational/IT roles in the public/private sector.
Both Rosie and Stephen have ITIL experience stretching back to ITIL v2 and have recently completed the ITIL 4 Managing Professional transition.