Why an ounce of software prevention is worth a thousand apologies

shutterstock_293767085

It’s not that I like to pick on Apple. Heck, I own no fewer than eight iOS devices, not counting the two I rely on to get my work done, and they work great almost all the time. But Apple’s history with software rollouts often begs for attention (See precedents here, here and here).

This time, it was the simultaneous iOS 10 and macOS Sierra rollouts that gave Apple and its users fits. The company called it the “biggest iOS release ever,” and it apparently flew like a brick. Or rather, it “bricked” many users’ devices. Here was one angry user’s response on Twitter:

The website bgr.com reported that there didn’t appear to be any logic to which devices bricked and why:

“In case you’re wondering, this is why you should always make sure your device is backed up before performing a vital software update. Until Apple works out this bug, you might want to hold off on updating, or at least use the update procedure through a PC or Mac to avoid bricking your device.”

Apple released a statement saying the problem was a “brief issue.” “We apologize to those customers. Anyone who was affected should connect to iTunes to complete the update or contact AppleCare for help,” the company said.

Apple recommends updating your device before upgrading the software. No duh. Here’s another no-brainer: Rigorously test your software updates before you roll them out. Service Virtualization can facilitate an unlimited number of use-case tests for exactly this reason. There’s little excuse for the release of routine upgrades that don’t work with your own products.

But Apple is far from the only transgressor in this category.

It’s not just Apple: Samsung is bitten by the bad code bug, too

According to Chris Williams over at the UK’s Register, Samsung has joined Apple in the software line of shame recently. The company’s Galaxy Note 7 (the same one with the exploding batteries) is among the Samsung models also suffering from Android platform crashes.

Wrote Williams:

“And its whizzy Exynos 8890 processor, which powers the Note 7 and the Galaxy S7 and S7 Edge, is tripping up apps with seemingly bizarre crashes – from null pointers to illegal instruction exceptions, all triggering randomly. It’s an issue that has stumped engineers for months.”

Finally, he wrote, they’ve figured out what happened. You should go read his extremely detailed explanation, but here’s the gist: illegal instruction errors were tripping up apps built by Mono and game emulators for Gamecube and PSP.

Seems like a problem that might easily have been sussed out with a few more tests, eh, Samsung?

Lousy software doomed a multi-million pound drone’

The crash of a $2.2 billion British Army drone aircraft in Wales two years ago has been blamed, in part, on ‘unfit computer software’, according to a new government report on the crash.  The drone was being flown by civilian contractors when it went down at Aberporth airfield in Ceredigion.

In his inquiry report, Air Marshal Dick Garwood said there “are a number of technical issues that will need to be resolved” before the drone program can be resumed, including “deficiencies to the [drone’s] landing software logic.” Human errors also factored into 15 contributing reasons for the crash, according to Wales Online.

Google apologizes after software update glitch crashes cloud

Google issued an apology late last month after a software update brought its cloud operations crashing down for users in the U.S. Central region. According to The Stack, the outage lasted nearly two hours after the update “was carried out on traffic routers while data was being moved between data centres.” Google noted on its status page:

“As part of this procedure, we first move a proportion of apps to a new datacenter in which capacity has already been provisioned. We then gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources. The applications running on the drained servers are automatically rescheduled onto different servers. During this procedure, a software update on the traffic routers was also in progress, and this update triggered a rolling restart of the traffic routers. This temporarily diminished the available router capacity.”

The outage caused error rates above 10 percent on 21 percent of apps in the Google App Engine.

This seems like an overload issue that could have easily been predicted had engineers modeled it using simulation tools such as Service Virtualization, but we’ll leave that to the experts at Google to figure out. For now, they say only that they’re sorry:

“We know that you rely on our infrastructure to run your important workloads and that this incident does not meet our bar for reliability. For that, we apologize.”

Don’t feel bad, Google. Many companies have been down this road. Remember, an ounce of prevention is equal to a thousand apologies. Don’t be the company that forgets that lesson.