2.145.2 Availability Management Process Description

Manual Transmittal

September 29, 2014

Purpose

(1) This transmits revised IRM 2.145.2, Availability Management, Availability Management Process Description.

Material Changes

(1) This transmittal establishes the initial IRM publication for the Availability Management process.

Effect on Other Documents

IRM 2.145.2 dated April 22, 2014, is superseded.

Audience

The Availability Management Process is applicable to all ACIO areas with responsibility for Availability Management.

Effective Date

(09-29-2014)

Terence V. Milholland
Chief Technology Officer

Process Description

  1. The Availability Management process translates business needs and plans into requirements for service and IT infrastructure, ensuring that the future business requirements for IT services are quantified, designed, planned, and implemented in a timely fashion.

Introduction

  1. Availability Management

Administration
  1. All proposed changes to this document should be directed to the Chief, Solution Engineering Process Maturity, owner of this process description (PD), and be pursued via the IPM process to clearly define interfaces, roles, responsibilities, and coordinate participation and collaboration between stakeholders.

Purpose of Process Description
  1. This Availability Management process description describes what happens within the Availability Management process and provides an operational definition of the major components of the process. This description specifies, in a complete, precise, and verifiable manner, the requirements, design, behavior characteristics of the Availability Management process. The PD is a documented expression of a set of activities performed to achieve a given purpose. Tailoring of this process in order to meet the individual needs of each project is covered in the Tailoring Guidelines section of this document.
    For the purpose of this document, roles such as Availability Manager, Service Owner, Availability Management Team Member, Business Owner, IT Operations Manager, and Higher Authority etc. are provided to describe a set of responsibilities for performing a particular set of related activities. Roles and/or responsibilities should fit your business terminology.
    The process managers and practitioners for the execution of this process are dispersed across several ACIO areas based primarily on the platforms the process impacts. All procedures for this process are local procedures that detail how the process steps are carried out for the particular platform the local procedures address. All local procedures are required to reference this IRM as being the process description that the procedure is written for. Local procedures are required to address the process steps in this IRM and conform to IRM 1.11.2.2.1.

Document Overview
  1. This document describes a set of interrelated activities which transform inputs into outputs to achieve a given purpose and states the guidelines that all projects should follow regarding the Availability Management process. The format and definitions used to describe each of the process steps of the Availability Management process are described below:

    • Purpose – The objective of the process step.

    • Roles and Responsibilities – The responsibilities of the individuals or groups for accomplishing a process step.

    • Entry Criteria – The elements and conditions (state) necessary to trigger the beginning of a process step.

    • Input – Data or material needed to perform the process step. Input can be modified to become an output.

    • Process Activity – The list of activities that make up the process step.

    • Output – Data or material that are created (artifacts) as part of, produced by, or resulting from performing the process step.

    • Exit Criteria – The elements or conditions (state) necessary to trigger the completion of a process step.

Process Overview

  1. Process Overview

Work Products
Work Products Used by This Process (Inputs)
  1. The following work products are used to assist in the implementation of the Availability Management process:

    • Business Availability Requirements

    • Business Impact Analysis

    • Identified Vital Business Functions

    • Availability, Reliability & Serviceability/Maintainability Requirements

    • Incident and Problem Data

    • Configuration and Monitoring Data

    • Agreed targets for Availability, Reliability and Maintainability

Work Products Produced by This Process (Outputs)
  1. The following work products (artifacts) are produced by the Availability Management process and may be used as inputs to other processes, such as:

    • IT Infrastructure Resilience and Risk Assessment

    • Measures of availability and agreed on availability targets

    • Monitoring & Reporting of Availability, Reliability and Maintainability

    • Availability Plan Updated

    • Availability Improvement Plans

    • System Recovery Strategies

    • Unavailability Analysis reports

    • Recommendations for changes to the Availability Plan

Roles and Responsibilities
  1. Many roles are involved in the Availability Management process. This section defines the roles used throughout this document in terms of their responsibilities.

  2. Roles and Responsibilities

    Role Description Definition of Responsibility
    Availability Manager
    • Primary source of management information on the Availability Management process

    • Overall responsibility for enhancing availability within dictates of sound return on availability investment decisions

    • Manages relationships between customers and Service Level Management (SLM)

    • Provision and operation of tools to support performance reporting on an ongoing basis

    • Ensure customer satisfaction with process

    • Ensure that business and stakeholders are involved in collecting availability requirements

    Service Owner
    • Deliver a particular service within the agreed service levels

    • Act as the counterpart of the Service Level Manager when negotiating Operational Level Agreements (OLAs)

    • Lead technical specialists and internal support units

    Availability Management Team Member
    • Provide technical expertise on availability of specific classes of Configuration Items (CIs)

    • Keep aware of availability characteristics of new products

    • Provide Tier III support as required in restoration of services following Major Incidents

    • On request of Problem Manager, participate as Availability Expert in Major Incident Reviews and Root Cause Analyses

    • Perform System Outage Analyses

    • Setting appropriate availability related requirements

    Business Owner
    • Negotiate availability targets and sign-off on targets in Operational and/or Service Level Agreements

    • Review availability performance against target and participate in preparation and recommendation of any remedial actions

    • Participate in Availability Improvement initiatives

    • Provide information on availability - particularly faults and outages

    IT Operations Manager
    • Provide ongoing maintenance of CIs which affects their availability

    • Assist in describing the service chains associated with service provisioning

    • Participate in discussions of availability and recoverability

    • Meet Operational objectives and explain variances from targets.

    Higher Authority
    • Approve resources to meet availability requirements

    • Review and recommend availability improvement recommendations

    • Review balanced scorecard information describing availability

    • Act as the final point of escalation for availability issues

Availability Management Process Flow Diagram
  1. Availability Management Process Flow Diagram

    Figure 2.145.2-1

    This is an Image: 61943001.gif

    Please click here for the text description of the image.

Availability Management Process

  1. Process steps:

Step 1: Plan and Design New and Changed Services
  1. Plan and Design New and Changed Services

Purpose
  1. The availability management process ensures that new or changed services are designed appropriately to meet the customer’s availability related requirements, defined in service level targets. The design must be developed not only to ensure that the new or changed service will meet its availability specifications, but also to ensure that performance of existing services is not negatively impacted. The work involves producing recommendations, plans and documents on design guidelines and criteria for new and changed services. The availability requirements of the business must be clearly defined and understood so that appropriate availability and recovery design criteria can be developed.

Roles and Responsibilities
  1. Availability Manager

  2. Service Owner

  3. Availability Team Member

  4. Business Owner

  5. IT Operations Manager

Entry Criteria
  1. A Request for Change for availability requirements has been received.

  2. A scheduled Availability Plan review is required.

  3. A condition has been reported requiring review of the Availability Plan.

  4. A related Service Level Agreement (SLA) review has been scheduled.

Input
  1. Business Information

  2. Business Impact Information

  3. Reports and Registers

  4. Service Information

  5. Financial Information

  6. Change and Release Information

  7. Service Asset and Configuration Management

  8. Service Targets

  9. Component Information

  10. Technology Information

  11. Past Performance Information

  12. Unavailability and Failure Information

  13. Planning Information

Process Activity
  1. Determine the Vital Business Information, in conjunction with the business and ITSM.

  2. Determine the availability requirements from the business for a new or enhanced IT service and formulating the availability and recovery design criteria for the supporting IT components.

  3. Define the targets for availability, reliability and maintainability for the IT infrastructure components that underpin the IT service to enable these to be documented and agreed tp by Service Level Management.

  4. Perform risk assessment and management activities to ensure the prevention and/or recovery from service and component unavailability.

  5. Design the IT services to meet the availability and recovery design criteria and associated agreed service levels.

  6. Establish measures and reporting of availability, reliability and maintainability that reflect the business, user and IT support organization perspectives.

  7. Determine the impact arising from IT service and component failure in conjunction with Information Technology Service Continuity Management (ITSCM) and, where appropriate, review the availability design criteria to provide additional resilience to prevent or minimize impact to the business.

  8. Implement cost-justifiable countermeasures, including risk reduction and recovery mechanisms.

  9. Review all new and changed services and testing all availability and resilience mechanisms.

  10. Produce and maintaining an availability plan that prioritizes and plans IT availability improvements.

  11. Review with stakeholders.

  12. Attend all related SLA reviews.

Output
  1. New or changed Availability Plan

  2. Availability Design recommendations

Exit Criteria
  1. Availability plan has been stored in the AMIS.

  2. Availability Design recommendations have been stored in the AMIS.

Step 2: Risk Assessment and Management
  1. Risk Assessment and Management

Purpose
  1. The purpose of the Risk Assessment and Management step is to ensure cost-justified reduction of risk and recovery from service and component availability. Risk assessment and management determines the impact arising from IT service and component failure in conjunction with ITSCM and, where appropriate, reviewing the availability design recommendations to provide additional resilience to prevent or minimize impact to the business.

Roles and Responsibilities
  1. Availability Manager

  2. Service Owner

  3. Availability Team Member

  4. Business Owner

  5. IT Operations Manager

Entry Criteria
  1. Availability plan has become available.

  2. New Availability Design recommendations have become available.

  3. Recognition or notification of a change of risk or impact of a business process, IT service, or component.

Input
  1. Previous risk assessment reports

  2. Availability plan

  3. Availability Design recommendations

  4. Notifications of change of risk or impact of a business process, IT service, or component

  5. Availability, Reliability & Serviceability/Maintainability Requirements

  6. Configuration and Monitoring Data

Process Activity
  1. Determine the impact arising from IT service and component failure in conjunction with ITSCM.

  2. Review the availability design recommendations to provide additional resilience and prevent or minimize impact to the business.

  3. Complete the Availability Design.

Output
  1. Revised risk assessment reviews and reports

  2. New or changed Availability Design

Exit Criteria
  1. Revised risk assessment reviews and reports have been stored in the AMIS.

  2. New or changed Availability Design has been stored in the AMIS.

Step 3: Implement Cost-Justifiable Countermeasures
  1. Implement Cost-Justifiable Countermeasures

Purpose
  1. The purpose of the Implement Cost-Justified Countermeasures step is to produce risk reduction measures and develop effective recovery mechanisms that address the risks identified by the risk assessment review. Countermeasures are implemented as part of the overall design of the new or changed service, as well as through the implementation of maintenance and continual review and improvement.

Roles and Responsibilities
  1. Availability Manager

  2. Service Owner

  3. Availability Team Member

  4. Business Owner

  5. IT Operations Manager

Entry Criteria
  1. Revised risk assessment reviews and reports have become available.

  2. New or changed Availability Design has become available.

Input
  1. New or changed risk assessment reviews and reports

  2. New or changed Availability Design

  3. Planned and Preventive Maintenance Plan

  4. Unavailability Analyses

  5. Service Failure Analyses

  6. Component information

  7. Service information

  8. Technology information

Process Activity
  1. Review all relevant inputs.

  2. Identify targets for countermeasures

  3. Develop cost-justifiable countermeasures, including risk reduction and recovery mechanisms.

  4. Implement cost-justifiable countermeasures, including risk reduction and recovery mechanisms.

  5. Create or change the Planned and Preventive Maintenance Plan as necessary to include new or changed countermeasures.

Output
  1. Implemented countermeasures, including risk reduction and recovery mechanisms

  2. New or changed Planned and Preventive Maintenance Plan

Exit Criteria
  1. Countermeasures have been implemented.

  2. New or changed Planned and Preventive Maintenance Plan has been stored in the AMIS.

Step 4: Review All New and Changed Services and Test all Availability and Resilience Mechanisms
  1. Review All New and Changed Services and Test all Availability and Resilience Mechanisms

Purpose
  1. The purpose of this step is to complete a review of all new and changed services and test all availability and resilience mechanisms. During the service transition stage all the elements designed to contribute to service and component availability needs to be reviewed and tested. Availability review and testing procedures and policies should be embedded into overall transition methods, processes and practices to ensure that the promised levels of availability will be delivered.

Roles and Responsibilities
  1. Availability Manager

  2. Service Owner

  3. Availability Team Member

  4. Business Owner

  5. IT Operations Manager

Entry Criteria
  1. Notification of a new or changed service.

  2. New or changed Planned and Preventive Maintenance Plan has become available.

  3. Countermeasures have been implemented.

Input
  1. Service information

  2. Planned and Preventive Maintenance Plan

Process Activity
  1. Review all new and changed services.

  2. Develop a test plan to test all availability and resilience mechanisms.

  3. Develop a testing schedule for testing availability, resilience, and recovery mechanisms.

Output
  1. New or changed Availability Test Plan

Exit Criteria
  1. The new or changed Availability Test Plan has been stored in the AMIS and made available to the service transition team.

Step 5: Continual Review and Improvement
  1. Continual Review and Improvement

Purpose
  1. The purpose of the Continual Review and Improvement step is to continually pursue discovery of opportunities to optimize the availability of the IT infrastructure. The benefits of this regular review approach are that, sometimes, enhanced levels of availability may be achievable, but with much lower costs. The optimization approach is a sensible first step to delivering better value for money. A number of availability management techniques can be applied to identify optimization opportunities. It is recommended that the scope should not be restricted to the technology, but also include a review of both the business process and other end-to-end business owned responsibilities. To help achieve these aims, availability management needs to be recognized as a leading influence over the IT service provider organization to ensure continued focus on availability and stability of the technology.

Roles and Responsibilities
  1. Availability Manager

  2. Service Owner

  3. Availability Team Member

  4. Business Owner

  5. IT Operations Manager

  6. Higher Authority

Entry Criteria
  1. Scheduled Availability Testing has been performed.

  2. New Availability Test results are available.

Input
  1. New Availability Test results

  2. Availability Plan

  3. Configuration Item information

  4. Information on available technology

  5. Availability Management Reports

  6. Availability Requirements

  7. Availability Design Criteria

  8. Availability Design

  9. Availability Test Plan

  10. Planned and Preventive Maintenance Plan

Process Activity
  1. Review all relevant inputs.

  2. Identify opportunities to improve availability.

  3. Update the Availability Plan to prioritize and plan improvements.

Output
  1. New or updated Availability Plan

Exit Criteria
  1. The new or changed Availability Plan has been stored in the AMIS.

Step 6: Monitor and Measure Component Availability
  1. Monitor and Measure Component Availability

Purpose
  1. The purpose of the Monitor and Measure Component Availability step is to detect unavailability events and SLA/OLA violations and to deliver appropriate notifications upon detection.

Roles and Responsibilities
  1. Availability Manager

  2. Service Owner

  3. Availability Team Member

  4. Business Owner

  5. IT Operations Manager

Entry Criteria
  1. Availability monitoring infrastructure is operational.

  2. Monitoring of agreed availability targets against actual availability are in place.

Input
  1. Availability monitoring data

  2. Current availability metrics

  3. Past performance from previous measurements, service achievements and reports

  4. Unavailability and failure information from incident and problem reports

Process Activity
  1. Detect unavailability events.

  2. Store collected data in AMIS.

  3. Establish measures of availability and agree on availability targets with the business.

  4. Identify unacceptable levels of availability that impact the business and users.

  5. Monitor the actual availability delivered versus agreed targets.

Output
  1. Notifications of unavailability events

  2. Monitoring data reports

  3. Monitoring, management and reporting requirements for IT services and components to ensure that deviations in availability, reliability and maintainability are detected, acted on, recorded and reported

  4. Service availability, component availability, reliability, frequency of unavailability and maintainability reports

Exit Criteria
  1. Monitoring and measuring are an ongoing activity.

  2. Measures of availability and agreed on availability targets with the business have been established and stored in the AMIS.

  3. Monitoring, management and reporting requirements for IT services and components have been stored in the AMIS.

  4. Service availability, component availability, reliability, frequency of unavailability, and maintainability reports have been stored in the AMIS.

Step 7: Investigate Service and Component Unavailability and Instigate Remedial Action
  1. Investigate Service and Component Unavailability and Instigate Remedial Action

Purpose
  1. The purpose of the Investigate Service and Component Unavailability and Instigate Remedial Action step is to investigate and undertake appropriate corrective actions when services or components become unavailable based on agreed terms.

Roles and Responsibilities
  1. Availability Manager

  2. Service Owner

  3. Availability Team Member

  4. Business Owner

  5. IT Operations Manager

  6. Higher Authority

Entry Criteria
  1. Notification that a service or component has become unavailable.

Input
  1. Unavailability and failure information

Process Activity
  1. Investigate unavailability events.

  2. Implement remedial actions.

  3. Recommend changes to the Availability Plan if appropriate.

  4. Document actions taken in the AMIS.

Output
  1. Unavailability Analysis and Remedial Actions report

  2. Recommendations for changes to the Availability Plan if appropriate

Exit Criteria
  1. Remedial actions have been implemented.

  2. The Unavailability Analysis and Remedial Actions report has been stored in the AMIS.

  3. Recommendations for changes to the Availability Plan have been stored in the AMIS.

Tailoring Guidelines

  1. This process may not be tailored to meet specific project requirements. If tailoring is permitted, refer to the tailored approach according to what has been documented in the Availability Management process owner’s Tailoring Plan. All tailoring requests, with supporting rationale, should be submitted in writing to and approved by Solution Engineering.

Process Measurement

  1. Management will regularly review quantifiable data related to different aspects of the Availability Management process in order to make informed decisions and take appropriate corrective action, if necessary.

Training

  1. The following specific training is needed to perform the process:

    • Availability Management Process

CMMI, ITIL and PMI Compliance

  1. The Capability Maturity Model Integration (CMMI) can be used to judge the maturity of an organization’s processes and related process assets and can be used to plan further improvements. CMMI sets the standard for the essential elements of effective and mature processes, improved with quality and efficiency.
    The Information Technology Infrastructure Library (ITIL) contains a collection of best practices, enabling organizations to build an efficient framework for delivering IT Service Management (ITSM) and ensuring that they are meeting business goals and delivering benefits that facilitate business change, transformation, and growth.
    The Project Management Institute (PMI) organization advances the project management profession through globally recognized standards and certifications.
    This process asset is used to indicate that all artifacts are developed or acquired, incorporating CMMI, ITIL, and PMI requirements, to meet the business objectives of the organization and that they represent investments by the organization that are expected to provide current and future business value to the IRS.

Definitions, References

  1. Definitions, References

Definitions
  1. A Glossary is available on the IT Process Asset Library (PAL).

References
  1. The following resources are either referenced in this document or were used to create it:

    • ITIL V3

    • IRM (Enterprise Life Cycle Guide)

    • IRM Engineering Chapter, Availability Management Section

    • Availability Management Policy

    • IPM Process Description template

Exhibits

  1. Exhibits A & B

Exhibit A: Glossary
  1. Glossary

    Term Definition
    Availability The ability of a service, component or CI to perform its agreed function when required. It is often measured and reported as a percentage.
    Availability Management Availability management addresses the ability of an IT component or service to perform at an agreed level over a period of time and support the business at a justifiable cost. The high-level activities realize availability requirements, compile an availability plan, monitor availability, and monitor maintenance obligations.
    Availability Management Information System A version controlled public repository containing all Availability Management artifacts.
    Availability Plan The Availability Plan contains detailed information about initiatives aimed at improving service and/ or component availability.
    Countermeasures Actions taken to remedy problems or incidents
    Maintainability A measure of how quickly and effectively a service, component or CI can be restored to normal working after a failure. It is measured and reported as the mean time to restore service (MTRS)
    Proactive Availability Management Involves the work necessary to ensure that new or changed services can and will deliver the agreed levels of availability and that appropriate measurements are in place to support this work. Proactive activities include planning, design and improvement of availability. These activities are principally performed as part of the design and planning roles.
    Reactive Availability Management Involves the work necessary to monitor, measure, analyze and report service and component availability. Reactive activities include the monitoring, measuring, analysis and management of all events, incidents and problems involving unavailability. These activities are principally performed as part of the operational roles.
    Reliability A measure of how long a service, component or CI can perform its agreed function without interruption. The reliability of the service can be improved by increasing the reliability of individual components or by increasing the resilience of the service to individual component failure (i.e. increasing the component redundancy, for example by using load-balancing techniques). It is often measured and reported as the mean time between service incidents (MTBSI) or mean time between failures (MTBF)
Exhibit B: Abbreviations and Acronyms
  1. Abbreviations and Acronyms

    Acronym Definition
    AMIS Availability Management Information System
    AM Availability Management
    CI Configuration Management
    CMMI Capability Maturity Model Integration
    CTO Chief Technology Officer
    ELC Enterprise Life Cycle
    ES Enterprise Services
    IPM Integrated Process Management
    IT Information Technology
    ITIL Information Technology Infrastructure Library
    ITSM Information Technology Service Management
    IRM Internal Revenue Manual
    ITSCM Information Technology Service Continuity Management
    OLA Operational Level Agreement
    PAL Process Asset Library
    PD Process Description
    PMI Project Management Institute
    RFC Request for Change
    SE Solution Engineering
    SLA Service Level Agreement
    SLM Service Level Management