Server Downtime: Understanding true cost is key for creating effective plan to manage risk

Server downtime is an issue that plagues the vast majority of organisations. But the reality is that most organisations do not know the answer to this question. Here’s what they should do to know the true cost of downtime for cost-effective protection.

Vikramjeet Bhatti

Apr 5, 2018 7:18 PM IST

Server downtime is an issue that plagues the vast majority of organisations. When servers go down, many—if not all—of an organisation's most critical applications become unavailable, and the cost of being unable to do business mounts minute by minute. But how much does server downtime cost companies? The reality is that most organisations do not know the answer to this question. Few measure their cost of application downtime, and even if they do, they measure it incorrectly. Without knowing this exact cost, an organisation's ability to make sound investments in data center technology and availability protection is impaired.

Ultimately, high availability is a business decision that is based on the value a computer system has to an enterprise. When thinking about availability solutions, it is important to consider that the applications that levy the highest downtime cost on your organisation are likely to be the ones you want up and running first after a server outage. What is the value of these applications, and what does it cost you when they are inaccessible? Without knowing the true cost of downtime, your organisation can't properly and cost-effectively protect itself.

“Availability”, what does it mean?

-Advertisement-

The term “availability” is a characterisation of how reliably computing system can function and run the tasks it was designed to run. When talking about servers, there are different levels of availability.

Read | Cyber Insurance: Indian businesses can't take it lightly anymore

Backups and restores: Basic backup, data-replication, and failover procedures are in place via conventional servers. Recoverability translates into 99% to 99.9% availability.

High availability: Applications are accessible a very high percentage of the time. Users perceive little or no interruption if there is a failure. High availability translates into 99.95% to 99.99% availability.

Continuous availability: Even if there is a failure, server operations are not interrupted. Downtime is eliminated, and data is not lost in the event of a server failure. Continuous availability translates into 99.999% availability

At first glance, there doesn't seem to be a huge difference among all these percentages of 9s. However, when considering that Aberdeen Research recently found that downtime now costs an organisation an average of $138,000 per hour, the cost differences become readily apparent.

Stratus Technologies: Understanding True Cost Of Server Downtime Is Key For Creating Plan To Manage Risk — Server downtime is an issue that plagues the vast majority of organisations. (Photo/Agency/File)

Cost of downtime involves much more than just lost wages

It may be a surprise that the cost of downtime involves much more than just lost wages and actually affects the entire company. But when calculated correctly, the cost of downtime impacts the whole organisation, with no one group seeing the entire impact. Sales are lost. Employee productivity goes down. Customers become frustrated. And competitors can benefit.

The costs of downtime include both direct and indirect costs. Direct costs are expenses that can be completely attributed to the production of specific goods or to a particular function or service, whereas indirect costs are more difficult to quantify— but can be even more damaging to an organization.

Costs of downtime can be broken down into the following categories:

Business costs: These are the first costs that come to mind for most people. Lost wages, overtime, and remedial labor costs all add up during an outage. Sales can be lost and so can future repeat business. For example, imagine the consequences if a retailer experiences downtime during the holiday season. Potential customers won't have the time or patience to put up with an outage and will take their business elsewhere.

Other business costs include lost inventory and the scrap of work in progress, potential legal penalties for not delivering on service-level agreements, and litigation costs due to third parties seeking compensation for losses incurred during a system outage.

Productivity costs: During an outage, employees can't perform their regular duties. The impact of this idle time varies by industry. For example, in an office environment, an employee may not be able to access the Internet but can work on a desktop spreadsheet program, so perhaps his or her productivity would be cut in half. But in a manufacturing environment, if the line stops, employees may be 100% unproductive.

Recovery costs: These costs include the price paid to repair the system, IT staff overtime, and third-party consultants or technicians needed to restore services.5 Another consideration: the opportunity cost sacrificed when IT needs to focus on system recovery instead of working on other critical projects for the organization.

Customer loss: The effects of indirect costs can be felt long after an outage is resolved. Previously loyal customers can lose faith and take their business to competitors. Once a company is seen by its customers as unreliable, it can be very difficult to undo the perception.

Reputation damage: Bad publicity can cause major damage to an organization—and not just large ones. It's true that the traditional press loves a good headline about bad news at a big company. But what can a complaint on Twitter or a negative post on Facebook cost you? Convergys found that one bad tweet can cost a company 30 customers. And while industry websites and bloggers don't always have a large audience base, they do have the rapt attention of your target market. This means that one negative blog post about your company can make a huge impression on your customers and prospects.

Shareholder value impact: Bad press can also devalue a company's stock and reduce its market capitalization. Especially in shaky economic times, the stock market reacts to negative press about a company, even more so if the news is about a significant sales loss—an event that is entirely possible when servers go down.

Downtime effects vary by industry

Downtime affects different types of organizations in different ways. It's important to take these additional factors into consideration when thinking about the cost of downtime. In Manufacturing, unexpected downtime can mean lost inventory, a lower-quality product, and/or unsalable products. In some cases, a momentary disruption in production can cause an entire run to be scrapped due to regulatory guidelines—a potentially devastating scenario and a harsh reality for food and pharmaceutical manufacturers. When production deadlines are missed, the business can be impacted both financially and in terms of reputation.

Read | With Artificial Intelligence, chatbots are overhauling customer service business

The Retail Sector is hit hard by IT downtime, losing $18.18 billion per year due to outages. A single downtime event for a retailer can be a huge blow to its financials, especially when such an event happens during a holiday shopping season. In a store setting, point-of-sale (POS) systems need to be up and running to process sales and maintain the flow of customers throughout the store. Server downtime can mean poor customer service -when employees can't check on product availability and long, slow checkout lines for customers trying to make their purchases.

Public safety organisations have their own unique set of concerns. Public safety 911 call takers, dispatchers, and first responders all depend on the applications and information managed by computer systems to protect lives and property. Downtime of public safety answering point (PSAP) applications causes slower emergency response times and can tragically result in the worst type of loss: loss of life.

For Financial Services organisations, downtime affects transactions. Customers want to quickly and securely complete their transactions, whether over the Internet, by telephone, at a local branch office, through an ATM, or via debit/credit card. When downtime occurs, financial institutions are hit hard on a company level: A CA Technologies report says that revenue loss due to IT downtime is $224,297 per company each year.

Options for protecting your critical applications

After the cost of downtime is thoroughly understood, it's important to consider the level of availability your most important applications need. For business-critical applications such as CRM, ERP, back-office databases that run the business, financial software, and email servers, service interruption and data loss are very expensive. Some critical businesses or services where downtime is not an option would be:

Manufacturing execution systems (MESs)
Security systems
Trading and banking systems
Electronic medical record (EMR) systems
Applications that support emergency response operations
Applications that control life-sustaining processes
Military and civilian security applications

Some options to consider for guarding against downtime could be as follows:

Standard Servers: Always-On Level of 99%

A standard x86-based server typically stores data on RAID (redundant arrays of independent disks) storage devices. However, a standard x86 server may have only basic backup, data-replication, and failover procedures in place, which means it would be susceptible to catastrophic server failures. A standard server is not designed to prevent downtime or data loss. In the event of a crash, the server stops all processing and users lose access to their applications and information, so data loss is likely. Standard servers also do not provide protection for data in transit, which means if the server goes down, this data is also lost.

Traditional High-Availability Solutions: Always-On Level of 99.9% to 99.95%

Traditional high-availability solutions that can bring a system back up quickly are based on server clustering: two or more servers that are running with the same configuration and relate to cluster software to keep the application data updated on both/all servers.

While high-availability clusters improve availability, their effectiveness is highly dependent on the skills of specialized IT personnel. It is also important to note that downtime is not eliminated with high-availability clusters. In the event of a server failure, all users who are currently connected to that server lose their connections. Therefore, data not yet written to the database is lost.

Advanced High-Availability Solutions: Availability of 99.99%

The most advanced high-availability solutions are software designed to prevent downtime, data loss, and business interruption, with a fraction of the complexity and at a fraction of the cost of high-availability clusters. These solutions are equipped with predictive features that automatically identify, report, and handle faults before they become problems and cause downtime. Two important features of advanced high-availability software are that it works with standard x86 servers and doesn't require the skills of highly advanced IT staff to install or maintain it.

Fault-Tolerant Solutions: Availability of 99.999%

Fault-tolerant solutions are also referred to as continuous availability solutions. A fault tolerant server provides the highest availability because it has system component redundancy with no single point of failure. This means that end users never experience an interruption in server availability because downtime is pre-empted.

The need for availability has become critical to most organizations. “The server is down” is not an acceptable excuse for systems not working. Downtime affects the whole organization, and its costs are both direct (lost wages, lost sales, lost customers) and indirect (loss of employee productivity, reputation damage, opportunity costs).

Understanding the cost of downtime to your organization is the first step in creating a plan to manage the risk. The next step is to think about the level of availability that your most critical applications need and whether your goal is to recover as quickly as possible after a failure has occurred or to prevent failure altogether. The answers to these questions will help you determine the appropriate course of action for your organization.

The author is Managing Director – India & SAARC, Stratus Technologies.

RELATED ARTICLESMORE FROM AUTHOR

How AI can enhance your workflow automation

Can smart meters make India energy efficient?

How can India transform two crore women into ‘Lakhpati Didi’

RELATED ARTICLES MORE FROM AUTHOR