FANDOM


SummaryEdit

  1. Number of study hours: 15
  2. Short description of the course
  3. Arguments discussed in this LO describe the main principles of IT Service Management, providing an overview of the main planning, developing, provisioning and maintenance processes of IT services. Some of the processes described in the LO are directly referred to the ITIL ® “best-practices”, which represents a “de facto standard” adopted by several companies in order to reduce costs and align IT systems to business goals.
  4. Target groups: The employers of IT core level professionals are the target sector. The project objectives directly address the promotion of high knowledge and skills standards in IT area and in particular provide an innovative approach to training. The first target group consists of IT students (vocational school IT basic level training and the first courses of colleges and universities) in technology area and IT practitioners not having vocational certificates yet.
  5. Prerequisites: Student must have an overall knowledge about the Information Technology principles and terminology.
  6. Aim of the course - learning outcomes: Aim of this module is to provide to students some guideline and best-practices useful to the comprehension and management of the main , developing, provisioning and maintenance methodology of quality IT services, in order to be able to satisfy both the customer necessities and the financial objectives of IT organization

ContentsEdit

C.7.1 Customer relationships and service level agreementsEdit

C.7.1.1 Service Level Management Process and its benefitsEdit

The SLM process is one of the key activities within IT Service Management (in the Service Delivery area), since it is through its various steps that are established and formalized all agreements between IT organization and customer. Service Level Management Process is the process of negotiation, defining, measuring, managing and improving the quality of IT services. The main aim of this process is to ensure that the quality and cost of services provided are really those agreed with the customer. To achieve this it is essential that both the supplier and the client understand that a service is provided and received respectively. Another important SLM goal is to simplify communication between the client and the managers for services provided.

The following main documents are formalised during the SLM process:

  • Services Catalogue: This document describes the entire set of services that the IT organization is able to provide, associated with various levels of service that can be guaranteed.
  • Service Level Requirements (SLR ): This document describes in detail the demands of the customer. The SLR is the result of collaboration between IT organization and the customer.
  • Service Level Agreements (SLA ): in the SLA are defined the agreements established between customer and supplier, in relation to the services that must be provided and service levels that must be guaranteed
  • Operational Level Agreement and Underpinning Contracts: in these documents are formalized agreements established within the IT organisation (OLA ), or between the latter and any third parties involved in the provision of services (UC).
  • Service Quality Plan: this document describes in detail what should make the IT organization to ensure service levels agreed.

The SLM handles the following tasks too:

  • Monitoring and reporting
  • Service Improvement Plan (SIP )
  • Customer Relationship Management

The services improvement programme (SIP) defines activities in charge of improve the quality of services. The CRM includes all activities relating to communication with the customer.

C.7.1.2 Main elements of a service level agreement (SLA)Edit

One of the Service Level Management process main activities consists in the SLA document definition. In this document they must be defined in precise way the customer expectations about the negotiation object and the supply conditions. A SLA represents a contract stipulated between the services supplier and the customer, in which the following main aspects are defined:

  • customer required services that must be supplied from the IT provider
  • the evaluation metrics related to any single service (useful to define acceptable or not acceptable service levels)
  • the responsibilities and duties of both the parts
  • the actions that must be undertaken in certain conditions

A typical SLA usually specifies:

  • Which services are required
  • The duration of the agreement, during which the development of the service is required
  • The conditions in which the agreement has validity (environmental them, organizational, technical etc
  • Level of the expected performances
  • The report documentation that the supplier will have to produce about the trend of the supplied service, both in terms of measures to supply and in terms of survey measures modality and reporting recurrence
  • The responsibilities of both the parties
  • The price that the purchaser acknowledges to the supplier for the service provision
  • The provided penalty in case of service distribution in a way not in compliance with the agreements

There are some key rules that must be followed in compilation of a valid SLA document:

  • It is first of all necessary to spend time in order to define and to reach an agreement on who is responsible for what
  • Every SLA item must be measurable.
  • Every SLA item must be extremely specific.
  • It’s important that anyone is involved in the SLA is represented in the process of negotiation and creation of the agreements.
  • The agreements creation process is iterative. The parties create a SLA rough draft. Then, this draft is submitted to the respective groups for changes, additions, explanations. The process proceeds until both groups are satisfied by the result.

C.7.2.3 Compare the uses and purposes of SLA, underpinning contracts and operational level agreementsEdit

When an IT organization and the customer agree an efficient availability level, in terms of costs, it is necessary to formalize the agreement in a document; it is defined as Operational Level Agreement (OLA). This document is a vital element for SLA definition. Contrary to SLA contract, which is a juridical and formal document agreed by IT organization and customers, an OLA is an operational document that provides details on services described in the SLA. OLA defines the requirements (in terms of time and resources) related to IT services developers and maintenance team. Contracts with external suppliers are mandatory, but many organisations have also identified the benefits of having simple agreements with internal support groups, usually referred to as OLAs.

Some of main characteristics usually defined in an OLA are:

  • Business provided processes definition and theirs relevance for the organization
  • Parties (signers) of the OLA
  • Number of services users
  • Business impact of inactivity or unavailability
  • Critical service periods (e.g.: usage peaks, end of months)
  • Periods in which inactivity is tolerable
  • Inactivity periods scheduled for maintenance and updating activity
  • Inactivity time before to activate emergency plans
  • Costs of inactivity or unavailability time
  • How the availability cans be evaluated (depending on inactivity times and minimal performance required)
  • Minimal access point to provide
  • Availability reporting timing and modality

In some contexts, not all the services defined in SLA are directly provided by IT organization that signed the agreement with Customer.

Most IT Service Providers are dependent to some extent on their own suppliers (both internal and/or external). They cannot commit to meeting SLA targets unless their own suppliers’ performances underpin these targets. In these cases, some agreements on required services are contracted between IT organizations. These agreements are formalized through Underpinning Contracts.

Composition and contents of these documents are similar to Operational Level Agreement and are totally transparent to final customer. Unlike the OLAs, UC are mandatory if there is there are some external suppliers.

Monitoring and reporting

Immediately the SLA with internal documents (OLA and UC) is agreed, monitoring must be instigated, and service achievement reports must be produced. Operational reports must be produced frequently (daily – perhaps even more frequently), and where possible, exception reports should be produced whenever an SLA has been broken. Periodic reports must be produced and circulated to Customers (or their representatives) and appropriate IT managers a few days in advance of SLA reviews, so that any queries or disagreements can be resolved ahead of the review meeting. The meeting is not then diverted by such issues. SLA must be a flexible document that cans be reviewed and updated on the strength of possible new agreements between customer and provider.

In order to monitor and evaluate the obtained service levels, it is necessary that these levels are observable and measurable. To this purpose, it is vital to assign one or more specific evaluation metrics to every service defined in SLA. These metrics must be defined in the SLA too.

Moreover, SLA must define the acceptable service levels and the financial penalties related to not achievement of required service levels.

In the following, there are some possible metrics useful to service evaluation:

  • Service availability time percentage
  • Number of simultaneous users that services must attend
  • Specific service
  • Failure response time of the provider

Monitoring and reporting activities of the Service Level Management have the following aims:

  • Negotiate and arrange the required service levels
  • Measure and report the obtained service levels, the required resources and the service provisioning costs
  • Continually improve provided service levels
  • Coordinate the other service management processes
  • Periodically review Service Level Agreements, in order to keep abreast of business and customer requirements
  • Service levels evaluation must be performed according to pre-defined metrics. It is advisable to define every metric in accordance with value criterion agreed with customer.

C.7.2 Capacity and Contingency PlanningEdit

C.7.2.1 Describe capacity management (as defined in ITIL) and explain the importance of the three sub-processes of “business”, “service” and “resource”Edit

In order to assure a satisfactory answer to the business requirements it is necessary that the adequate capacity (in terms of resources) is always available. The Capacity Management process main task is to assure the steady availability of the needed resources in order to maintain a correct balance between “business request and IT capacity”. Capacity Management is responsible for:

  • Monitoring the performances and the throughput of IT services and supporting infrastructures components.
  • To manage the existing resources in the best efficient manner.
  • To quantify the essential resources, in relation with the business requests.
  • To produce forecasts for future business requirements in order to keep pace with the business
  • To assure that the supplied resources are able to guarantee the quality of the services defined in the Service Level Agreement.

All the CM activities must always be finalized to the maintenance of a balancing among:

  • Costs against capacity: in order to obtain this balance it is necessary to use the available resources in an efficient way
  • Supply against demand: it is essential that the supplied capacity is suitable to current and future business demands.

The ITIL model divides the process into three CM sub-processes:

  • Business Capacity Management: responsible for ensuring that the future IT services requirements is planned properly. It examines the current resources use by the different IT services to assess the prevalent trends and develop appropriate forecasts on future requirements.
  • Service Capacity Management: its aim is to verify the provided services performances and the performance of resources on which those services are based. To make sure that the service levels defined in the SLA agreements are always guaranteed, they are constantly monitored and analyzed and, in case of any problems, we must point out the causes. Some problems, in fact, may be caused by inadequate resources availability.
  • Resource Capacity Management: this sub-process aims to manage IT infrastructure resources, to analyze work loads, and to determine needed resources, ensuring that all resources are acquired and implemented so quickly and cheaply.


C.7.2.2 Identify the purpose and describe the main elements of a capacity planEdit

Many of the aforementioned activities are the same for all the three sub processes, but each of them is responsible for completely different aspects

The Business Capacity Management focuses the attention on current and future business requirements, the Service Capacity Management is centred on the aspects concerning the abilities needed to support the existing services, whereas the Resource Capacity Management is focused on the technology that represent the support infrastructure for all services.

The activities carried out by the Capacity Management process are realized in the achievement, the maintenance and the periodic review of the Capacity Management Database and the Capacity Plan document.

The information about the various services supplied performances, collected during the monitoring activities, are outlined inside of an only document, called Capacity Management Database (CDB). The CDB can be considered like a centralized repository used by the three Capacity Management sub-processes in order to store the collected data. This information will be used as a benchmark for the capacity requirements planning and forecast activities in relation to monitored services. Furthermore, into the CDB, there are stored information about the relations and the dependencies among the various services, the resources, and between services and resources.

In order to correctly estimate the various services performances and in order to generate a complete and efficient CDB, it is necessary to be able to measure and to quantify the monitored subjects. It is therefore fundamental to establish exact metrics for the evaluation of the supplied services and the associated resources quality. The services quality must be measured in accordance with some different points of view: business evolution, technology, current business requirements and financial aspects. For any of the specified metrics, the Capacity Management process would have to define the wished objectives and to guarantee the achievement of them.

Capacity Plan

The forecasts and the quantifications of the resources needed to support the services and the business demands are formalized into a specific document, the Capacity Plan.

Each of the three sub processes of the Capacity Management Process it contributes to define and to update the contents of this document. Like the SLA, also the Capacity Plan is a flexible document, which cans undergo modifications and improvements after its first drawing up. Is advisable that this document is periodically audited and updated.

The information collected in the CDB is indispensable for the realization of an adequate CP aligned with the effective business requirements.

The main goal of the Capacity Plan is to define in precise way the entire IT infrastructure that must be realized and maintained in order to guarantee the agreed levels of service.

A typical Capacity Plan document would have to include following contents:

  • Plan scope: they are specified the main elements of IT infrastructure, that are involved in the planning.
  • Used methods: this section defines how and when they have been obtained the information contained in the plan itself.
  • Business scenario: it is described the current context of the customer enterprise and some possible future changes of it.
  • Services report: there is a summarizing profile for each supplied service.
  • Future levels of service: it describes the level of service increase forecasts.
  • Resources report: it supplies information about the use level the resources associated with the different services.
  • Future needed resources assessment: there are quantified and described the resources necessary to guarantee the estimated levels of services.
  • Service improvement options: it defines the actions that can be undertaken in order to improve current resources use.


C.7.2.3 Explain the concepts of risk, threat and vulnerability and give examples of each in an IT contextEdit

Before to describe the process of risks analysis and management is important to define the following terms

  • Risk: represents the possibility that the security of a system is violate because of an undesired event, with consequent damage to the system itself.
  • Threat: it is an undesired event that is able to cause a malfunctioning or a damage to the system. The threats must include both accidental and voluntary events. Every threat must be associated with a “level”, that indicates the possible frequency of events occurrence (based on statistics or experience). Such level will re-enter in the risk evaluation.
  • Vulnerability: intrinsic characteristic of the system that can lead, also accidentally, to damages and/or losses for the company. For every threat it corresponds a weak point into the system itself, whose characteristics are summarized in the vulnerability definition.
  • Asset: the term asset means any item (hardware, software and data) that must be protect in the IT infrastructure.

The second phase of ITSCM process (analysis of requirement and definition of strategies) is fundamental for the definition of the continuity plans. This phase includes a series of main activities:

  • They are estimated the threats and the levels of vulnerability. During the phase of requirement analysis, it is necessary to pay attention to these two critical aspects. (Risk Analysis)
  • They are evaluated and measured the levels of risk. (Risk Analysis)
  • They are defined the actions useful to reduce the risks to acceptable levels. (Risk Management)

Risk analysis implies the identification and evaluation of the risk levels related to the system resources. These risk levels are estimated on:

  • values attributed to the resources themselves;
  • levels attributed to the threats related to the resources;
  • vulnerability related to the resources.

Risk management implies the identification, selection and adoption of countermeasures justified by the risks detected for the resources, and the reduction of these risks to an acceptable level.

C.7.2.4 Give examples of risk reduction measuresEdit

Risk management methodology implies to make some choice for the management of each risk. The choice may be one of the following:

Avoid the risk

It means to not undertake an activity or to remove a function. As an example, if certainly asset the risk is particularly vulnerable in comparison to all assets remaining can be avoided excluding that detail asset from the system.

Reduce the risk

Risk reduction to an acceptable level involves the adoption of suitable countermeasures. The countermeasures can act in different ways:

  • To transfer the risk to a third part (e.g. to an assurance)
  • To reduce the chance of threat occurrence
  • To convert the vulnerability into threat
  • To reduce or limit the impact of a threat
  • To discover an accident or a threat occurrence

Accept the risk

The customer can decide to accept the risk, usually because the countermeasure is considered too much expensive.

The threats evaluation methods depend on the adopted methodology. As an example, for the CRAMM methodology, the level of threat can assume the following values:

  • Very Low: one incident every 10 years is expected
  • Low: one incident every 3 years is expected
  • Medium: one incident per year is expected
  • High: one incident every 4 months is expected
  • Very High: one incident per month is expected

The vulnerabilities evaluation methods also depend on the adopted methodology. In CRAMM, the level of vulnerability can assume the following values:

  • Low: in case of incident there is less than 33% probability that the worst scenario will happen
  • Medium: in case of incident there is a probability between 33% and 66% that the worst scenario will happen
  • High: in case of incident there is more than 66% probability that the worst scenario will happen

The phase of process threat assessment is strictly related to the risk analysis, whence the levels of considered threat are calculated, i.e. the threats that can cause a process interruption.

Some of these threats may be the following:

  • Natural disasters (fire, earthquake, storms…).
  • Appliance, storage support or documentation theft
  • Electric power provisioning interruption
  • Air conditioned system failure
  • Hardware failure
  • Hardware/software maintenance human error
  • Sabotage or act of vandalism

Risk level evaluation

The effects of a system threat depend widely on involved assets value. The analysis of consequences related to each kind of threat represents another aspect of risk analysis. In fact, the risk level is a proportion between the risk itself and the related asset valuation.

Risks are measured determining the potential damages relevance on the system caused by the threat occurrence. In this kind of evaluation we use the probability value that the threat always causes the worst damage when it happens.

As first approximation, the risk can be defined as the product of the threat effect relevance and the threat occurrence probability:

R = D \times P, where R = Risk, D = Damage, P = Probability

This generic definition, in information security area, can be improved considering, for witting threat, the probability as a function of system vulnerability and the attacker motivation (called threat level).

P = f (V, T) , dove V = Vulnerability, T = Threat

For accidental threats, the incident occurrence probability is a function of the vulnerability related to the threat and of the incident occurrence intrinsically probability (e.g.: intrinsically probability of natural disaster like water-flood, black-out, fire, in the regarded area)

P = f (V, p), where V = Vulnerability, p = intrinsically probability

The seriousness of threat effects is usually expressible in terms of economic damage sustained by the involved company, or in terms of equivalent damage class. In the first case we have a quantitative analysis, whereas in the second one we have a qualitative analysis. Finally the risk level can be viewed as a function that considers, other than the risk, the interrelation between the risk and the related asset value.

Risk level L = f (V,T,A), where V = Vulnerability, T = Threat, A = Asset value

Risk analysis conclusions

Once a risk level has been estimated, it’s necessary to define the countermeasures related to the threats. Every countermeasure must be specific in relation to the asset and its threats.

Every countermeasure must be synthetically described, in order to simplify its identification and its possible subsequent update.

The defined countermeasures must cover all recognized risks.

Once the countermeasures have been associated with the various assets, it can be estimated the residual risk. In fact, the risk cannot be totally removed, despite the adoption of appropriate countermeasures. Once the residual risk has been estimated, it’s possible to decide if the risk has been suitably reduced or if it’s necessary to review the adopted countermeasures.

Generic concept of risk analysis and management can be represented by a simple diagram where risk analysis and risk management are two distinct but correlated activities.

Comparison between Risk Analysis and Risk Management

In this diagram there are represented the basic element of the risk analysis and management (asset, threat, vulnerability, risk and countermeasure) and the relation between the two different activities.

Disaster Recovery Plan

The Disaster Recovery Plan is a document that is often enclosed to Service Continuity Plan. It defines some the process, policies and procedures of restoring, within a restricted time slot, the operations critical to the resumption of business, after a disastrous event. The interruption time must be suitable for the business requirements and it must trend to make as minor as possible the losses caused by activity interruption.

This plan must schedules technological and organizational solutions which let to the company to proceed to operate in its critical activities during the emergency and until the restoration of a regular effectiveness situation. The choice and definition of the processes and services that must be protected by Disaster Recovery solutions, it’s a critical phase in the Plan drawing up.

C.7.2.5 Identify the purpose and describe the main elements of a contingency/service continuity planEdit

One of the main goals of a good IT Service Management process would be to avoid or to minimize the damages that accidental events can cause to the business.

In the IT Management, the term “disaster” means any event that can cause an interruption of supplied IT services, damaging more or less seriously the business.

The definition of countermeasures to apply in case of disaster (contingency/service continuity planning) represents the core of the IT Service Continuity Management process.

Inside of a Service Continuity Plan there are defined the actions that must be undertaken in order to face a disaster, relatively to its seriousness.

The Contingency Plan, instead, represents an additional plan, in which are described the countermeasures to apply in case of failure of the Service Continuity Plan.

The main target of the process is to minimize the business interruption caused by the disaster.

The ITSCM process can be divided into four main phases:

1. Initial phase

During this phase there are defined some drafts of contingency and continuity plans regarding the single business components. These drafts can represent a good input for the entire process.

2. Analysis of requirement and definition of strategies

In this phase there are reached some agreements between IT organization and customer. Such agreements specify mainly the risks for which it is necessary to define some countermeasures (and, consequently, to spend money for them). As an example, for events that have a lowest probability to happen, the customer can not have intention to allocate any sum for planning of the alternatives.

3. Implementation

This phase is formed by following processes:

  • Risk reduction methods implementation
  • Service Continuity Plan and Contingency Plan formalization
  • Recovery methods and procedures development
  • Initial testing of defined plans

4. Continuity plan maintenance and periodic review

Continuity plan must be always aligned with the business evolution.

C.7.3 Availability managementEdit

C.7.3.1 Identify the purpose and benefits of availability and define the concepts of availability, reliability, failure and recoveryEdit

Request of 24-7 services is becoming more and more habitual in IT business environments. The temporary lack of this availability can cause very bad impacts on the customer satisfaction and, consequently, on the company reputation. Moreover, because of the hard bond between business and IT components, it’s important that the Availability Management Process is developed in high regard the business requirements and characteristics. Into the Service Delivery Area, the Availability Management Process is responsible for the management of service availability with aim of guarantee the maintenance of an availability level that can satisfy business requirements.

The Availability Management (AM ) Process, as defined by ITIL ®, uses the following terminology:

  • Availability: represent the ability of an IT component (or a service) to execute the requested functions, in a given moment or in a specific time interval.
  • Reliability: it is the ability of an IT component to supply the requested functionalities, in a given time interval and specific conditions (as an instance in “stress” conditions of the system).
  • Failure: if, due to an incident, there are some inefficiency of supplied services and the involved services are unavailable, there is a failure.
  • Recoverability: this characteristic represents the ability to briefly and correctly recover services involved in a failure.

The Recoverability property includes the following characteristics:

  • Maintainability: a service is maintainable if is possible to constantly maintain it in an operative state
  • Resilience: represents the ability of a service to remain in an operative state even if one or more IT components are damaged.
  • Serviceability: contractual term that defines the support to services provisioning offered by third party in case of unavailability.

The Reliability property cans be included in the Recoverability concept too.

Other main concepts in the Availability Management process, are:

  • Security: they must be implemented control policies in order to guarantee the maintenance of some specific services security parameters.
  • Vital Business Function (VBF ): they represent the critical elements of whole the business process supported by IT services.

C.7.3.2 Compare some of the commonly-used measures of availability (percentage availability, frequency of failure, mean time between failures, impact of failure)Edit

The life-cycle of a generic incident can be split in the following phases:

  • Incident occurrence: the user realize the problem
  • Detection: the appropriate IT team is informed about the incident
  • Diagnosis: the incident causes are determined
  • Repairing: the countermeasures able to remove the problem are applied
  • Recovery: the involved services are completely restored

Depending on this partition, there are the following definitions:

  • TTR , Time To Repair: it’s the time interval between incident detection and the recovery phase termination (it corresponds to downtime interval).
  • TBF , Time Between Failure: time interval between recovery from a failure and the detection of the next incident (it’s called uptime interval).
  • TBSI , Time Between System Incident: it’s the time elapsed between two near incidents.


Unavailability Life-Cycle

The drawing represents, on the time axe, the various phases of an incident life-cycle. After an incident occurs, the first event is its detection, after which there is the incident diagnosis and the application of countermeasure (repair). Finally, involved services are recovered.

In order to analyze and measure the availability levels, various metrics can be used:

  • MTTR (Mean Time To Repair): decreasing this value system availability and maintainability increase.
  • MTBF (Mean Time Between Failure): the mean uptime must be maximized in order to increase availability levels.
  • MTBSI (Mean Time Between System Incident): as for the MTBF, this value is proportional to availability and reliability. MTBSI is equal to the sum of MTTR and MTBF.
  • Availability percentage: this metric measures the availability in term of percentage, on the strength of this formula:

Failed to parse (lexing error): \%Availability = \frac{AST -– DT}{AST} \times 100


where: AST = analyzed service time, DT = downtime

There are other availability measures that can be:

  • Unavailability percentage: it’s exactly the reverse of the availability percentage and is useful to define the unacceptable availability levels.
  • Duration: it can be obtained converting the availability percentage in terms of time
  • Frequency of Failure (FOF): it is the mean frequency of failure (service interruption).
  • Impact of Failure (IOF): it gives an evaluation of the seriousness of the service damage caused by a failure (in a business perspective).

C.7.3.3 Give examples of availability management methods and techniques, such as component failure impact analysis (CFIA) and fault tree analysis (FTA)Edit

For the time being, there are several techniques and methodology useful to the development of Availability Management Process.

Among these, the prevalent are the following:

  • CFIA : Component Failure Impact Analysis
  • CRAMM : CCTA Risk Analysis and Management Method
  • FTA : Fault Tree Analysis

In the following the main aspects of these methodologies will be described.

Component Failure Impact Analysis

CFIA methodology is based on analysis and forecast about the impact that IT components failures can cause on the provided IT services.

The CFIA main activity consists in the creation of accurate and reliable information that cans be used as input for planning and recovery activities of Availability Management process. This information concerns mainly:

  • The impact on business and users that cans be caused by an IT component failure
  • The availability points of failure
  • Recovery time for each components
  • The need to identify and define recovery options
  • The need to identify and define some risk reduction measures

Information produced by CFIA also cans be useful for IT Service Continuity Management process.

The technique used by CFIA, in order to produce the above described information, is based on a static analysis of the IT infrastructure architecture. In this activity a scheme that represents the dependences between IT services and the IT infrastructure components is produced, in order to identify:

  • Critical services, which is the services whose availability depends on correct functioning of a considerable number of IT infrastructure components.
  • The single points of failure, which correspond to the components whose failure can cause a serious impact on several IT services.
  • The services for which there are defined some efficient recovery from failure procedures.

It’s clear that the dependences scheme cans be extended with some additional information (e.g.: probability of component failure, service recovery time…) in order to obtain a more detailed and comprehensive description that represents a useful starting point for the activities of Availability planning, improvement and recovery.

Fault Tree Analysis

The fault tree analysis technique (FTA) provides a method useful to define the events chain that causes an IT service failure.

FTA allows, in a graphical and logical way, to connect faults of the various system components. The man goal is to correlate in a functional manner an IT service failure with infrastructure components faults.

The following figure represents a simple Fault Tree model, wherein the tree root corresponds to an IT service failure, whereas the nodes depict the combined events that represent the fault cause:

Fault Tree sample

The figure represents a fault tree model related to a generic service fault. Nodes represent the events that produce the root event (that is a service failure), if combined by logical operator that label the arches. The tree leaves are: “Power OFF”, “software error”, “Main link breakdown” and “Backup link breakdown”. There is only one middle node: “Network down”.

Events analyzed by Fault Tree Analysis are the following:

  • Base event: it’s failure of a single IT infrastructure component (e.g.: hardware failure or human error). In the fault tree basic events are represented as terminal nodes.
  • Resulting event: it is an intermediate node in the fault tree that corresponds to the result of an events combination. The root of the tree also is a resulting event.
  • Conditional event: a conditional event is an event that cans happen solely under specific conditions (for example, a fault of the air conditioned system cans cause a service malfunctioning only if the room temperature exceeds certain bounds)
  • Trigger event: it is an event that causes one or more other events (e.g.: power down can cause PCs automatic stop).

Events in a fault tree are combined by using the following logical operators (logic gates), that correspond to the tree arches:

  • AND: the resulting event arises if and only if all the input events are true.
  • OR: the resulting event arises if at least one of the input events is true.
  • Exclusive OR: the resulting event cans arise if only one of the input events is true.
  • Inhibition: the resulting event cans arise if and only if the initial condition is not true.

Fault Tree Analysis is a support instrument that cans be useful to the Availability Planning and Improvement activities into the Availability Management Process.

C.7.4 Service DeskEdit

C.7.4.1 Explain the purpose of a Service Desk in a service support organizationEdit

Differently from the other procedures pertaining to the two areas of Service Delivery and Service Support, the Service Desk is not defined as a process; it representr, in fact, a function, of vital importance for the entire Service Management Process.

The Service Desk cans be defined as “the single point of contact among the customer, the IT organization that provides services (and third parties) and users”. Using functions offered by Service Desk, users can communicate with the service provider, as an example to signal incidents occurred, to request some service changes, or to obtain help and assistance in order to use the services.

Goal of Service Desk

Generally, the adoption of a Service Desk has the following aims:

  • To guarantee a bidirectional communication between system and users (assistance and contextual training on the one hand, collection of suggestions, requests and claims on the other hand), in order to optimize the entire IT infrastructure.
  • To allow the desk operators to work with the greatest efficiency, solving quickly every doubt that an inexpert user cans submit. Another related goal is to collect every suggestion about possible service improvement.
  • To provide an operative support to other processes in the Service Support area (Change Management and Availability Management processes) and to the anomaly correction and recovery from incident actions.

A proactive approach

Several organizations are still managing the “support problem” in a totally reactive manner, without any structured form of planning and collaboration among teams that are in charge of this task. The Help Desks and Call Centres use this approach, which is not much expensive in the short period, but, in the long period, it has some disadvantage:

  • the same problems being resolved repeatedly rather than eliminated
  • uncoordinated and unrecorded change takes place
  • staff resource/cost requirements being unclear
  • no management information available – decisions being based on ‘I think’ rather than ‘I know’
  • low customer confidence/perception

There is the need to create well-framed ad-hoc solutions, in order to provide a proactive support service that constitute part and parcel of the Service Management Process. The modern Service Desks are founded on this approach.

C.7.4.2 Identify the different types of service desk and describe the circumstances in which each is appropriateEdit

Designing a Service Desk support infrastructure correctly is critical to success and should be done as a formal business improvement project with clear ownership, defined business goals and responsibilities. There are three main SD models:

  • Local Service Desk: a single Service Desk for each phisycal location of the customer company. In this case the standardization of used procedures and the SD known sharing is fundamental.
  • Central Service Desk: in this arrangement, all service requests are remotely logged to a central physical location. This solution the operational costs are reduced and resources management is optimized.
  • Virtual Service Desk: a unique support centre that can be accessed by all the locals SD. This solution is commonly adopted by the companies that have many locations located in different countries.

It is important to define some evaluation metrics in order to estimate the SD effectiveness, clarifying the main goals of the desk. In this way it is possible to evaluate in an objective way the levels of supplied services.

In order to correctly design a good Service Desk, it is recommended to adopt the following basic rules:

  • Clarify and define business rquirements and necessities
  • Verify the availability of needed resources and financial budget, before to begin the implementation
  • Define operative solutions (quick wins)
  • Define clear objectives and deliverables for the Serivce Desk
  • Involve/consult the customers and the end users in the definition of Service Desk functions
  • Define evaluation metrics
  • Educate/train Customers and Users in the use of the new service and its benefits

Finally, it’s recommended to follow an incremental approach in Service Desk implementation, staring with a minimal function set that cans be gradually extended .

Nowadays, there are several technologies tha can be adopted to implement a Servoce Desk:

  • Advanced telephone systems (e.g.: Computer Telephony Integration (CTI ), Voice Over Internet Protocol (VOIP ), etc.)
  • IVR systems (Interactive Voice Response)
  • Electronic mail (e.g. voice, video, mobile comms, Internet, email systems) Fax Servers
  • Search and diagnostics tool
  • Automated operations and Network Management tools

Among the aforementioned technologies, there is not a next best one; choice depends on the context (field of business, customer tastes…) in which the Service Desk is implemented and on the functionalities offered by Desk (for instance, e-mail can be used in case of not urgent requests).

Computerised Service Desk

A lot of the typical activities of a Service Desk (requests gathering, logging and monitoring of incidents and their causes…) lend themselves to be computerised.

Unlike the “manual” Service Desk, a computerised or half-computerised Service Desk provide additional benefits:

  • Every request cans be tracked and stored
  • Duplicate, lost or forgotten requests are eliminated
  • Complex support tasks and calculations are made easier
  • A history of incidents and their countermeasures is maintained (known-errors)
  • Information collected by Service Desk are available for the staff

A computerised Service Desk (“self-service”) offers to users the opportunity to obtain support services, in an autonomous manner, without direct intervention from a support professional.

It can be used as a method of reducing operating costs and improving Customer satisfaction by allowing them greater control over the transaction, especially out of normal support hours and for non-critical activities. Technologies such as the Internet, Interactive Voice Response systems, and mobile and wireless communications make self-service operations possible.

Mainly fetures of a “self-service” SD are:

  • Customers have direct access to support information (knowledge increasing)
  • Customers can autonomously manage some of the support activities (the not criticals)
  • Requests and not critical problems solving time can be minimized.

A successful self-service strategy depends on several important factors:

  • Control of computerised activities: it’s significant that assistance processes and tools lead users in a default and not changeable action sequence.
  • Use of business metrics: it is needed to plan and schedule any possible user request, in order to implement some ad-hoc support functionalities.
  • Received requests control: the support computerised activities used by end users must not modify valid configurations.
  • Easiness utilization: critical or complex support activities must be directly managed by Service Desk team.
  • Communication: the customer must know which are the computerised support service provided by SD and its utilization rules.

Responsibilities and functions

Tasks and responsabilities of Service Desk are strongly depending on the business context and the supported IT infrastructure.

For many organizations, tha main role of Service Desk is to manage and track the lifecycle of incidents that have an impact on IT services.

Managing of incidents that cannot be quickly solved by Service Desk, is usually delegated to specialized support team (called “second line team”), which will responsible for solving the incident, informing Service Desk about the obtained results.

The Service Desk is responsible for keeping users informed about any progresses in the problem solution activities, it is an interface between users and problem management team.

The common Service Desk functions include:

  • Receiving calls, first-line Customer liaison
  • Recording and tracking Incidents and complaints
  • Contributing to Problem identification.
  • Keeping Customers informed on request status and progress
  • Making an initial assessment of requests, attempting to resolve them or refer them to someone who can
  • Managing the request life-cycle, including closure and verification
  • Communicating planned and short-term changes of service levels to Customers
  • Coordinating second-line and third-party support groups
  • Providing management information and recommendations for service improvement
  • Highlighting Customer training and education needs
  • Closing Incidents and confirmation with the Customer


C.7.4.3 Define the main elements of an incident management system (as referenced in ITIL)Edit

The informations regarding relieved incidents, collected by SD, can be used to evaluate quality and availability levels about the entire IT infrastructure.

The analysis of this information can be very useful to other Service Management processes as, for instance, Availability Management, Service Level Management, Business Continuity Management process and for all the processes in the area of Service Support.

As an example, if the same software problem happens several times, it could be necessary to update or change the involved tools; this activity will be carried out by Change Management process.

The collected informatin are also useful for the Service Desk, in order to plan proactive solutions for supporting and managing problems.

The actions carried out by SD in order to solve problems or the user reported incidents, compose the Incident Management process.

Main goal of this process is to readily reestablish the regular operation of involved services, minimising the negative impact that an incident cans cause on business.

In the ITIL® terminology an incident is “any event that is not part of the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service.”

Examples of incidents are:

  • Software error (e.g.: caused by a bug)
  • Hardware mulfunctioning
  • Human error (e.g.: missing password)

The requests for hardware/software update or change are not considered as incidents, but they are classified as “Request For Change” (RFC ) and are managed by Change Management Process.

The status of an Incident reflects its current position in its life-cycle, during its managing by Incident Management process (IM ).

Some examples of status categories might include:

  • New (just after the signalling)
  • Accepted (if it is an incident, it must be managed by IM)
  • Waiting for the assignment
  • Assigned/dispatched to specialist
  • Work in progress (WIP )
  • Resolved
  • Closed.

It’s important that users are informed on progress in incident management. The Service Desk is responsible for provide a Customer with an up-to-date incident managing progress report.

Process actions

After any user notification, the SD performs following actions:

  • Information recording (date and time, user that reported the notification…)
  • If the notification is a change request, SD manages it in accord with some procedures defined by the organization
  • If it is an incident, SD collects and records the information about incident causes (if they are detectable)
  • A priority level is assigned to the incident. Incident priority depends on the business impact and on the urgency to reestablish the involved services
  • The SD searches a possible solution, in the stored information about previous incidents (many incidents are resolved directly by SD)
  • If there is a well-known solution, it is applied and the incident is closed
  • If there is not a solution, the incident is passed to a support team. When the team resolves the problem, the SD stores information about incident
  • SD informs users the incident closure

The major benefits to be gained by implementing an Incident Management process are as follows:

For the business as a whole:

  • Reduced business impact of Incidents by timely resolution, thereby increasing effectiveness
  • The proactive identification of beneficial system enhancements and amendments
  • The availability of business-focused management information related to the SLA.

For the IT organisation in particular:

  • Improved monitoring, allowing performance against SLAs to be accurately measured
  • Improved management information on aspects of service quality
  • Better staff utilisation, leading to greater efficiency
  • Elimination of lost or incorrect Incidents and service requests
  • More accurate CMDB information (giving an ongoing audit while registering Incidents)
  • Improved User and Customer satisfaction.

Some ITIL® guidelines for planning Incident Management are as follows:

  • Do not plan to implement and operate Incident Management in isolation. If possible, the scope of planning should be extended to include the implementation, integration and operation of the the other Support processes. If resources are not available to implement all Service Support processes at the same time, begin by implementing the Service Desk function together with Incident Management.
  • Plan for the creation of a Database which must contain information about recorded incidents. This DB is usually managed by Problem Management process.
  • Plan for an interface with the Problem Management system to assist Service Desk staff in recognising and giving advice on circumventing Known Errors.



Links to additional materials

  • Hamman W., Why ITIL ®?, Article available at http://www.itsmwatch.com/itil (as of 03/2005).
  • IT service management, An introduction, Based on ITIL®, 2nd edition. Van Haren Publishing. 2004.
  • Schiesser, R., IT Systems Management: Designing, Implementing, and Managing World-Class Infrastructures. Prentice Hall PTR, 2002.
  • IT Service Management, IT Service Management Forum/CCTA , ITIMF Ltd., 1995
  • McBride, D. Succesfull deployment of IT Service Management in the distributed enterprise. Hewlett-Packard Company, White paper 1998.
  • Anderson, K., Kerr, C. Customer Relationship Management (CRM). McGraw-Hill, 2002
  • ITIL® Service Delivery, OGC©, 2001.
  • Bouman, J.J., Trienekens, J.J.M., and van der Zwan, M. 1999. Specification of service level agreements, clarifying concepts on the basis of practical research, Proceedings of the 9th International Workshop Software Technology and Engineering Practice, IEEE Computing Society, Los Alamitos, pp. 103–111.
  • ITIL® Service Delivery, OGC©, 2001.
  • CRAMM Management Guide, CROWN©, 1996.
  • ITIL ® Service Delivery, OGC ©, 2001.
  • Availability Management, CCTA , HMSO, 1993.
  • CRAMM Management Guide, CROWN©, 1996.
  • ITIL ® Service Support, OGC ©, 2001.
  • Ferris K., How ITIL® brings benefit to the help desk, Support World, HDI Publication, 2003.
  • Twitchell, M. C., Moving from helpless desk to help desk: practical strategies for improving customer service in a multi-function university help desk, Proceedings of the 25th annual ACM SIGUCCS conference on User services, 1997.

Test questionsEdit

(C.7.1 Customer Relationships and Service Level Agreements)

Question 1. In which order, in the SLM process, the following documents are defined?

  1. Service Level Agreement
  2. Service Catalogue
  3. Service Level Requirements
  4. Operational Level Agreement and Underpinning Contract


Question 2. SLA contract…

  1. … is just a guide-line for IT organization
  2. … is a flexible document that can be modified by customer agreed with provider
  3. … is a flexible document that can be modified directly by customer
  4. … is a static document that cannot be modified


Question 3. Amongst the following, which are the goals of service monitoring and reporting activities?

  1. To negotiate and agree desired service levels
  2. Constantly improve service levels
  3. To define service provisioning cost
  4. To periodically review SLAs in order to be aligned with business needs


(C.7.2 Capacity and Contingency Planning)

Question 4. All the activities covered by the Capacity Management process must always be finalized to the maintenance of a balancing among:

  1. Costs against capacity
  2. Supply against demand
  3. Business against demand
  4. IT organization availability against requests

Question 5. As first approximation, how the risk can be calculated? (R = Risk, D = Damage, P = Probability of damage)

  1. R = D + P
  2. R = \frac{D}{P}
  3. R = D \times P
  4. R = D - P


(C.7.3 Availability management)

Question 6. Which of the following are true?

  1. The term availability means the ability of an IT component to supply the requested functionalities, in a given time interval and specific conditions
  2. If, due to an incident, there is some inefficiency of supplied services and the involved services are unavailable, there is a failure.
  3. The Recoverability characteristic represents the ability to briefly and correctly recover services involved in a failure.
  4. The terms Reliability means the ability to maintain a service constantly operative

Question 7. Which of the following are availability metrics?

  1. Unavailability percentage
  2. Frequency of Failure
  3. Impact of Failure
  4. Number of incidents for year


(C.7.4 Service Desk)

Question 8. Which of the following processes are parts of Service Delivery?

  1. Configuration Management
  2. Availability Management
  3. Change Management
  4. Capacity Management

Question 9. Which are the main goals of a Service Desk?

  1. To guarantee a bidirectional communication between system and users
  2. To allow the desk operators to work with the greatest efficiency
  3. To provide an operative support to other processes in the Service Support area
  4. To check system consistency after configuration changes

Ad blocker interference detected!


Wikia is a free-to-use site that makes money from advertising. We have a modified experience for viewers using ad blockers

Wikia is not accessible if you’ve made further modifications. Remove the custom ad blocker rule(s) and the page will load as expected.