What SHOULD Procurement Officials Learn from CrowdStrike?

A recent article over on on GovTech titled What Can Procurement Officials Learn from CrowdStrike caught my eye because I wondered if it contained the most important lesson.

The article, which sub-headlined on how CrowdStrike is a useful lesson for officials who draw up government IT contracts, pushing them to ask the question of how future contracts can prepare for any unplanned outages, hit on five important point(s) of modern SaaS / Cloud-powered technology.

  • additional safeguards are needed in IT contracts
  • even with safeguards, there is still the possibility of a cyberattack, so there must be an immediately actionable disaster response and recovery plan (which vendors must be able to live up to)
  • there should be alternate backup/failover options, even if non-preferred, and that can include paper in the worst case (as far as the doctor is concerned, it’s absurd when a store shuts down in broad daylight because they lost power or internet connectivity to the bank — that’s why we have cash and credit card imprint machines)
  • one should consider specifying liquidated damages up front, to prevent long drawn out lawsuits and delayed response time from the third party (who will want to avoid those damages)
  • consider cyber insurance, either on the vendor side or your side

Which is all good advice, but misses the most important point:

NEVER ALLOW A CRITICAL SYSTEM TO BE AUTOMATICALLY UPDATED (en masse)

Now, there’s a reason the military will exactly configure a system designed for single use and LOCK IT DOWN. That’s so it can’t accidentally go down from an unplanned / uncontrolled update when it’s needed most.

For example, there’s no way any update, no matter how minor, should be pushed out to a core airline operations terminal without an administrator monitoring the update (which could be on the vendor side IF the vendor maintains a [virtual] configuration that is the exact same as the customer’s configuration) and ensuring everything works perfectly after the update. And then the updates should be propogated to the rest of the terminals in a staged fashion. (Unless you’re dealing with a critical zero-day exploit that could expose financial or personal information, there’s no need for rapid updates; and even then, there should be techs on standby after that test update is complete just in case something goes wrong and a system has to be immediately rolled back or rebooted.)

Modern operating system installations, like Windows 11, can have up to 100,000,000 (that’s one hundreds million) lines of code and since you never know where the bugs are, there is no such thing as a low-risk update. Any update has the chance of taking down the OS or the application you are updating that is integrated with the OS.

But this is not the only critical lesson to takeaway. The next is:

For critical systems, your provider must maintain backup hot-swap redundant systems!

Once a configuration is confirmed to be bug-fee, it must be propagated to the backup, which must have a backup redundant data store with all transactions replicated in real-time (so that you’d never lose more than a minute or two of updates with an unexpected failure) that can be hot-swapped through a simple IP redirection should something catastrophic happen that takes down the entire primary system. This backup redundant system must have enough power to run all critical core operations (but not necessarily optional ones like reporting, or tasks that only need to be run every two weeks, like payroll, etc.) until the primary system can be brought back online. A catastrophic event like a rolling failure from a security or OS update or cyberattack should be recoverable in minutes simply by re-routing to the failover instance and rebooting all the local machines and/or restarting all the browser sessions.

Those are the lessons. If a system is so critical you cannot operate at all without it, you must have redundancy and a failover plan that can bring you back online with an hour, max.