Emerald Media Network

What To Do About Holiday Outages

By Mahesh Ramachandran, OpsRamp

It’s that time of year again. Forget turkey, cranberries and pesky in-laws: it’s time to get your shopping on. For IT organizations at retailers and e-Commerce companies, it’s an exciting time and also one where every detail matters.

So far, predictions are robust for sales, with eMarketer forecasting that this will be the first-ever trillion-dollar holiday season in the United States. The analyst firm predicts that Cyber Monday will once again kill Black Friday sales, and that e-Commerce will represent 13.4% of all holiday retail sales this year. U.S. digital revenue will grow 13% year over year (YoY) this holiday season, according to Salesforce Commerce Cloud data.

New Variables And Risks

The stakes are, as ever, high in the cutthroat global e-Commerce market. This year is particularly unusual here in the States because there are six fewer days between Thanksgiving and Christmas than in 2018. That means more people crowding your web site and other shopping channels on an average day to get orders in before the mid-December shipping deadline.

Speaking of channels — another trend, according to Salesforce, is that we will see more shopping move to the edge, as younger shoppers flock to social media and messaging applications to make their purchases. Retailers selling on these channels will need to consider the potential impact of edge sales on IT stability.

We’ve all heard it before: one hour of downtime can result in catastrophic revenue losses during a critical sales period. Although the comparison may not be useful for the average business, Amazon’s one hour of downtime on Prime Day 2018 may have cost the Internet giant an estimated $100 million in lost sales.

How To Sharpen IT Ops Strategies This Holiday Season

When it comes to helping companies ensure a successful online holiday season, IT operations plays a central role in preventing outages and keeping web sites and apps running optimally for impatient and distracted consumers. The strategy I recommend revolves around three core tenets of modern IT Ops: deep visibility, capacity planning and proactive incident response.

1. Visibility: The ability to see real-time status and metrics on infrastructure across the business is critical, so that your organization can understand vulnerabilities and bottlenecks. Armed with the best data you can possibly get on your environment, now you can easily assess the business impact of IT hotspots and capacity constraints to understand where the business might get into trouble during high demand. First understand what the steady-state looks like regarding interconnections, metrics and utilization. A map of this steady-state might highlight a point of danger, such as too many connections going through a single node. If that node goes down, the whole web site could also be out of commission. Take time to analyze the most likely scenarios that will happen during unpredictable surges in customer activity.

2. Capacity planning: Once you’ve done a mapping exercise, you can take preventive measures to lower risks during seasonal spikes, such as by adding in more routes or redundancy into the network. The point is to eliminate single points of failure. Balancing cloud versus on-premise capacity is another smart tactic from both the performance and cost perspective. Many organizations will rely upon internal IT resources for static, predictable demand and scale up cloud resources for the unpredictable surges in traffic.

3. Proactive incident response: The ability to proactively identify failure points in the IT environment is one of the hardest things for companies to do, yet it’s the only way to preempt systemic, business-impacting outages . AI technologies are now helping IT operations manage and control alert chaos and make correlations faster to get an accurate root cause analysis. It’s also valuable to understand what caused previous significant incidents and outages: modern ITOM systems enable rapid historical analysis. Since you can’t prevent all issues, having a process in place to quickly mitigate and respond, including how to best communicate with customers, is vital. Simulating incident response to a major issue is always a capital idea.

As a final note, make sure that backup sites are equally vetted and ready to go in case of an emergency. Too often, companies set and forget disaster recovery environments; ensure that you have enough capacity to handle a failover and that all of your DR systems have been updated and tested. With proper planning, a holiday outage will likely never happen to your business. But if it does, be ready so that the impact on customers is minimal.

Mahesh Ramachandran is vice president of product management at OpsRamp. He has 18 years’ experience spanning roles in product management and R&D in IT operations management, cloud computing, server virtualization, log/event management, operating systems, compilers and programming language runtimes.