- casino online slots

Season of Forgiveness and Healing

by Administrator on September 15, 2010

This headline probably stands out as a non sequitur given the news generated in the current electoral season and painful economy.  That said, I have noted that coverage of the State of Virginia’s recent storage disaster has been treated in just such a tone by most of the press covering the incident.  Here is the “timeline” of the incident as reported by AP:

Around noon on Aug. 25, a data storage unit roughly the size of eight refrigerators in a data center in Chester just south of Richmond sent a message indicating that something wasn’t right. An analysis showed that one of the two memory boards on the machine needed replacement.

A few hours later, a technician replaced the board and the moment it was put back into service “that’s when the real outage began,” said Sam Nixon, head of the Virginia Information Technologies Agency overseeing the contract with Northrop Grumman Corp., which provides the State of Virginia’s IT services. 

Officials now believe the memory board that was not replaced may have been the one that was faulty. The machine is supposed to recover on its own from such problems, but that didn’t happen in this case.

Workers continued to try to fix the problem that night, but were unsuccessful. A decision was made the next day to shut down the entire storage system overnight and replace all the internal components. The system was brought back online at 2:30 a.m. Aug. 27.

Workers discovered that the failure had corrupted many of the backup databases, so they had to recover the data from magnetic tape that must be spun up and loaded into the computer, a tedious process.

Nixon likened it to knocking over a file cabinet, spilling papers from files all over the floor. All the data is still there, but “we can’t find what we’re looking for,” he said.

The AP account  went on to report:

Northrop Grumman vice president Sam Abbate, who oversees the company’s $2.4 billion, 10-year contract to provide Virginia’s computer services, told the General Assembly’s investigative arm Monday that the company regretted the disruption, which hampered 26 of 89 state agencies. Some services, like getting drivers licenses and paying taxes, were unavailable for a week beginning Aug. 25.

Nixon said his agency is focused on figuring out how it could have improved recovery times.

“The overall outage itself – the hardware failure itself – was unacceptable,” Nixon said. “The amount of time it took to restore agencies to an operational speed was unacceptable.”

While 97 percent of the data was recovered, some remains lost. The Department of Motor Vehicles – one of the hardest-hit agencies – lost thousands of pictures and signatures from those trying to renew or obtain driver’s licenses in the four days preceding the failure.

A Minnesota company is trying to recover thousands of those files, but officials believe at least 4,200 pictures and signatures saved on Aug. 25 appear to be unrecoverable.

Okay, so that’s the best coverage we have of the event:  a dumbed-down exchange between pols and contractors about what caused the disaster and what was being done to assure that similar disasters would be avoided and/or recovered more effectively in the future.  No muss, no fuss — though Northrop Grumman is paying $250K for an independent review and at least $100K in fines and penalties (kind of small potatoes given the value of the 10 year outsourcing contract – $10 B!)

Case closed.  Let’s put it behind us and move on.  This is the season of forgiveness and healing, right?

Not for me, of course, bad boy that I am.

Not mentioned in the article — anywhere — is that the “refrigerator storage” is EMC.  A memory core meltdown is what the writer is trying to describe in popular vernacular.  This raises some interesting issues in itself.

EMC touts memory cache redundancies as a value add of its equipment, helping to justify its high cost by assuring the customer of performance and resiliency delivered by the configuration.  Hmm. 

That the wrong cards were replaced to correct the meltdown goes further to the issue of how well EMC gear monitors its own operation and accurately reports its faults.  Apparently there are no indicator lights that tell support engineers which boards are bad, no test rigs to enable field testing, and no software functionality in EMC’s onboard element manager to log errors for proper troubleshooting and break/fix.  That would seem to be a huge problem with this gear, increasing the propensity for disasterous interruption events.

On the other hand, it may be the fault of the support engineer who doesn’t know his job or how to use the tools provided on the rig to troubleshoot and correct errors.  That would also go to EMC’s much touted claims that its service and support are worth every dime of the hugely overpriced warranty and maintenance contract that it sells with every box.

Also, for mission critical database workload, EMC usually stuffs the consumer with Point in Time Mirror Splitting software and synchronous and asynchronous SRDF software (together with three copies of its rig to form a multi-hop mirroring HA configuration). 

The intent of the former (PIT mirroring) is to make copies of database states at periodic intervals so a database can be recovered expediently to a previous state should an error occur.  Was PIT mirroring being used in this case? 

The point of multi-hop mirroring, in addition to enabling EMC to sell a couple million dollars of redundant hardware and a quarter million or more in SRDF software licenses, is to ensure that primary array continuity volumes are replicated in near real time to a second stand of disks nearby, then remotely to a third stand of disk on an asynch basis — a hedge against the primary machine failure and against a total facility failure.  Was there a redundant configuration in use by Northrop?  (If not, EMC’s sales droids weren’t doing their jobs!)

The article also disses tape backup and restore.  I find this rich.  They successfully recovered 97 frackin percent of the data on the arrays, for goodness sake.  Tape was the hero, but it is cast as the villain.  This is pure marketecture used to cover the marketecture advanced loudly by EMC that tape is dead and that its own gear is so resilient you don’t need that tape stuff anymore.

If they are claiming, as is suggested, that tape restores take too long or exceed recovery time objectives, perhaps money would have been better spent on improving the tape environment and its configuration than buying all of the overpriced spinning rust from EMC in the first place!

This is a classic example of a potentially useful case study being dumbed down to avoid market ramifications. All of  EMC’s brochures and PowerPoints notwithstanding, their resiliency claims should be challenged by what happened in Virginia.  Everyone who has their gear, who drank their Kool Aide, should be a bit more nervous today about their choice.  Everyone who bought their story that tape is dead should be reconsidering that position.

I also think that Northrop should be a bit more transparent about the facts of the case.  That won’t happen, of course.  From my days working with a Beltway Bandit, I know that the key objective is to put this in the rear-view mirror so as not to spoil any other contracts in the offing.  I wonder whether the firm is also weighing its ability to place blame where it is due against the gag order in its EMC contracts prohibiting customers from speaking publicly about the performance it obtains from EMC gear.

As a value add, I want to invite all vendor readers to suggest how they would have improved on this situation with their wares.  Specifically,

  1. If you peddle gear, how do you isolate equipment faults and notify operators so that corrective action can be taken quickly and efficiently?
  2. If you peddle resiliency, what functionality in your software or hardware would have prevented the loss of 3 percent of the data from Virginia’s databases?
  3. If you peddle data protection, how would your wares have expedited data restore in this case?
  4. If you market superior service and support, how would your service technicians have facilitated recovery?
  5. If you sell solutions, what configuration would you recommend to NG on behalf of Virginia and its other customers that would beat EMC on a price, performance and resiliency basis?

I will publish all responses here and will take the best ones and publish them as an article in the trade press.

By the way, if the Minnesota-based data recovery shop being used to recover data off of the failed EMC rigs is On Track, why isn’t credit being given to those guys?  They do great work and have saved more butts from the fire than just about anyone in this industry.

Bottom Line:  Where I grew up, forgiveness wasn’t awarded until folks owned up to their transgression and promised to avoid the same misdeed in the future.  Healing began after debts had been paid.

Nobody gets a pass here!

{ 4 comments… read them below or add one }

LeRoy Budnik September 16, 2010 at 12:29 am

After reading Jon’s notes on the Virginia outage, for some reason the proverb “Time and tide wait for no man” came to mind. Checking the proverbs origin, it suggests that no man, no matter how powerful, can give orders to the sea or to time. Similarly, no product or risk mitigation strategy, no matter how expensive or layered can reduce risk from any source to zero. I have had to unwind similar problems. Yes, the vendor and outsourcer have responsibility, but there are so many other layers.

The Virginia scenario points to a growing, unspoken problem: mass consolidation increases risk. Why were 30% of the state agencies dependent on a single array? Did those responsible for risk assessment contemplate the dependencies? Was the scenario in their worksheets? If it was, what was the assigned probability? In their mitigation strategy, what was the terminal probability (post implementation of the mitigation strategy)? Big companies and governments have formal, annual, third party risk assessments and formal audits. On the risk side, was the risk assessment vendor unaware? The largest provider of these assessments (90+%) is a company named Marsh & McLennan. If it was Marsh, were they unaware? Here is the rub, Kroll On Track, the recovery company, is an operating unit of Marsh. Meanwhile, we have gone so crazy with consolidation to maximize utilization and defer cost, etc. that we do not factor in risk. I picked on the company who should know. Then there is Northrop, the service provider, who accepted the design from an OEM, with sub-intent of perceived savings to maximize contract profitability. Finally, there is the vendor, who has so engrained the message that they can no longer see new risk profiles as they emerge (note the word “new”). I could put names to this on so many events. It is so pervasive and not limited to storage, for example the client with 90 VMs on a server – what are they thinking? We have to pick on everyone in the chain for not recognizing dependency, risk and business impact.

Now, consider the event. Everything breaks. We use redundancy to reduce risk. The reported cause is human error. The single factor failure, bad memory, probably left the machine in a recoverable state, except for a secondary event, human error. The traditional protocols would seek to reduce risk and if possible, not make the change during business hours – although this is serious, it probably could have waited. With time, they could substitute risk, like break remote links, flush cash, block or freeze replicas, etc. Most would have remote support guide the operation; this would include setting a service led on the board they want replaced. Now the question: did the onsite tech pull the wrong card, or did the remote tech mark the wrong card?
Let us compare the old model of break-fix with the new (before going further).

In the old support model, you had both a remote and onsite support person with near equal training and experience – this is the model still in the minds of customers. In this scenario, the onsite tech would check errors, etc., before calling in, being setup on the management laptop. They would call in, and a remote person would do something similar and they would act “together”, although remote support could give orders. If you took actions on your own that lost data, you lost your job. It was something like the two-keys and code sets for a launch silo.

Now, the new model, to reduce cost an outsourced break-fix model is in place, even for the big arrays. One such company providing onsite techs is Unisys. The techs I have known are good people, work on many products, and generally are strong in the common sense department. We are grateful to them, because without them many companies could not offer product support. However, the console is not their thing. So the rules of engagement for them are to “do as they are told,” essentially, a transfer of responsibility. Fifteen years ago, we called this the deskilling of the workforce. The real meaning is, there is only one key. Bottom line: the traditional risk profile is in need of update to reflect an increase in error probability, given a single human making the decision (even if there are several people in the remote support center that may or may not look over each other’s shoulders). The best employees of the storage company are tasked product deployment and often cannot provide service after implementation. In many cases on the service side, it does not matter if the person was an employee or outsourced, working within a risk transfer environment, because they are de-skilled.

We all know, pull a live card in a machine doing local and remote replication and errors will cascade. It was preventable. However, once the error was recognized, one cannot wait into the night of the next day. Once recognized, was the client representative notified? Was there a policy to go to an emergency shutdown? Could the client service manager make that call? In addition, could the manager make the call quickly?

As a minor deviation, how much redundancy is the public willing to pay for? Will someone die if recovery is slow? Everyone has to make tough choices. Maybe a week is ok, as long as there is a protocol for notification.

Well, if I go any further, the State of Virginia should pay me to write the report.

Administrator September 16, 2010 at 8:05 am

Hi Leroy. Thanks for chiming in. Several others have done so through email. Seems like there are many war stories out there regarding EMC configuration failures.

Here is a bit more background reading on the event itself as covered in the trades…



Administrator September 16, 2010 at 8:21 am

Back to your philosophical observations, LeRoy. I agree with you up to a point, disasters can’t be completely prevented of course, and risk is often a function of trade-offs. No one is suggesting otherwise.

What galls me, frankly, is the banal reassurances that many vendors provide that their stuff is bulletproof and that wrap-around service by hugely qualified tech support personnel is an additional guarantor of resiliency — all of which justifies a giant mark-up on the price demanded both for the gear and for the service/support contract. It appears that this DMX3 and its support were sub par.

Also, I have rarely seen an instance in which EMC did not collaborate on the installation and configuration of its gear. I would not be so quick to let them off the hook just because “shit happens.”

LeRoy Budnik September 17, 2010 at 3:57 pm

I agree, and have the same sour task for the arguments.

In Harry Potter, one of the courses at Hogwarts was “defense against the dark arts.” Clients have to make a choices, will they defend against the dark arts of marketing or blindly accept the spells of reassurance and counter argument. You and I teach that course.

The most important messages include:

1) Consolidation increases risk and consequence
2) Changing service delivery models can (and does) increase risk
3) Previous control strategies may be insufficient and subject to new failure scenarios

Ranting is ok (it is your blog), but it does not fix process. EMC is absolutely on the hook. However, it is hard to let others off the hook, which in their role of protecting the client, accepted the rhetoric and should have known better. They started by choosing risk transfer to a vendor illusion and stopped thinking of other risk scenarios. During the event, which will be hard to unwind, everyone accepted some wild choices, like changing all of the cards. Meanwhile, in such a complex box, memory is the one thing that all of these queues for local and remote replicas, etc., feed through. Is it possible that corruption occurred at card failure? Were database procedures followed, host buffers flushed, replicas in process complete and split, RDF queues drained, and links suspended before replacing the card as a risk reduction strategy? (to preserve the bread-crumbs of restore). Apparently, not!

Maybe it is a good that this failure occurred. Perhaps now, people will learn to see through the vendor illusion. But wait, no one every does until it happens to them.

Previous post:

Next post: