kulifmor.com

Navigating the Risks of the "Make It Work" Button in Tech

Written on

Chapter 1: Understanding the "Make It Work" Button

Today, I'm going to share a cautionary tale that could unsettle many of you who rely on technology for your livelihood. Here’s the crux: the "make it work" button is often alarmingly close to the "break it all" button.

To illustrate this, let me recount a few incidents from my past.

In a previous role, I was part of a team that extensively utilized a Redis cache. Our service performed complex transformations on data, caching the results for efficiency. After years of smooth operation, we reached a point where the Redis engine could no longer handle the volume of requests and stopped responding.

After some troubleshooting, we decided to fail over from the primary node to the secondary and initiated a reboot because

Troubleshooting technical issues in IT

“Hello, IT, have you tried turning it off and on again?”

While this is generally a safe procedure, in this instance, the primary node was under heavy load and had not been keeping the secondary node in sync. Consequently, we cleared the cache and, upon trying to refill it, overwhelmed the new primary.

We thought we were activating the "make it work" button, but in reality, we pressed the "break it all" button.

Another example occurred at a company where we relied on proprietary load balancers (which shall remain nameless) and were experiencing significant growth (a good problem to have!). However, we had outgrown the capacity of our current load balancer and needed an upgrade to handle the increased throughput.

We took the straightforward approach: we purchased a more extensive license, set up the new load balancers, and redirected traffic to them. Within half an hour, our customer service team was inundated with frustrated users who couldn’t access our application. We quickly rolled back to investigate.

To spare you the three months of analysis, we discovered that an upgraded driver on the new load balancer didn’t come with reasonable defaults. While we had increased capacity at the license level, we had inadvertently limited it at the network card level.

Our team had a diligent certificate rotation process (which was beneficial!). We would update certificates well in advance of expiration, allowing ample time to address any issues. One time, however, we were grateful for this foresight. After uploading a new certificate and applying it to the correct endpoint, we were immediately inundated with complaints from users unable to use our iPad application.

Perplexed about the cause, we rolled back and investigated. It turned out that, although the new certificate was issued correctly, we had failed to include all the intermediate certificates in the bundle, which that version of iOS was checking.

Sometimes, these buttons are alarmingly close together. At another company managing numerous databases, we needed to migrate data from one drive type to another. After completing the migration, everything seemed fine, so we proceeded to decommission the old drive for cost efficiency.

The skilled IT administrator assigned to this task opened the Windows drive manager, selected a drive to detach, and executed the action.

And the database crashed.

Unbeknownst to him, he had detached the active volume.

The goal of your operations team should be to distinctly separate the "make it work" button from the "break it all" button.

“But we don't have an ops team; we only do DevOps!” you might argue. You could be correct, but here’s a thought experiment to determine who the de facto leader of your ops team is: Completely shut down your application. Cut off the load balancer, the database—whatever it takes to prevent any user activity. The first technical person to receive a call from the CEO is your ops team leader.

Fortunately, numerous tools can help mitigate the risks associated with these buttons. Here are a few of my favorites:

  1. Automate Whenever Possible

    A key strategy to avoid human error is to automate decisions and actions. If possible, remove the need for human intervention by utilizing scheduled jobs or alarm responses. While you might still need a human to receive alerts, aim for events to self-manage as much as possible.

  2. User-Friendly Command Line Interfaces (CLIs)

    Humans often struggle with remembering lengthy commands. If a human is needed for repetitive tasks that can’t be automated, provide them with a guided tool that includes input validation, sanity checks, and a preview of the planned actions before execution.

  3. Comprehensive Documentation

    For tasks that cannot be automated or guided, ensure you have thorough documentation. This should include:

    • Situational examples for using the documentation
    • Detailed action steps, possibly with screenshots
    • Validation steps
  4. The Two-Person Rule

    When automation or guidance isn’t feasible, apply the two-person rule. Have someone monitor your actions (in person or via screen sharing) to provide an extra layer of scrutiny before you proceed.

  5. Brian’s "Hands Off the Keyboard" Rule

    This rule doesn’t directly separate the buttons, but it encourages reflection before clicking any confirmatory buttons. Remove your hands from the keyboard or mouse and take a moment to contemplate your choice. This pause allows you and the person observing to potentially halt an impulsive action before it becomes a costly mistake.

Chapter 2: Mitigating Risk Through Best Practices

In this chapter, we’ll delve into practical strategies to ensure the safety and reliability of your tech operations.

Video: Best Smart Buttons for Home Assistant (WATCH before you BUY!)

This video explores smart buttons designed for home automation, providing insights to help you make informed purchasing decisions.

Video: Getting the most out of Echo "Smart Home" Buttons

Learn how to maximize your use of Echo smart home buttons, enhancing your home automation experience.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Mastering the Trailing Stop Indicator with Python Techniques

Learn how to create a trailing stop indicator in Python, focusing on volatility and risk management strategies.

Unlocking True Financial Freedom: 10 Essential Strategies

Discover ten effective strategies to achieve financial freedom, including investment tips and income generation methods.

Tools for a Happier Future: What You Need to Know

Discover essential tools to ensure your future self feels fulfilled and accomplished.