Ask A Question, Get a Million Answers
My clients continue to be concerned about the compliance/governance impact of de-duplication. I wanted to know what the governance crowd was thinking, so I asked the experts over at Dan Swanson’s most excellent message board at Yahoo Groups. Dan has in his cadre of regulars some of the top dogs in top companies responsible for compliance and governance planning for their firms. Thought it would be the right place to look.
Now, I’m not so sure. First, the moderator said he wasn’t sure if the thread was right for the group, but he would allow a group vote. Fair enough. My response:
I defer to the group, of course, and can take this discussion elsewhere if desired. But if this group is not the right one to answer a question so germane to the practice and intention of information governance, then which one is?
Vendors just want to sell stuff. I don’t see them volunteering their legal folks to stand between the customer and the regulator if the technology they have sold is putting information governance policies into jeopardy.
Moreover, the Front Office doesn’t understand the technologies that are being applied to meet one goal (IT cost containment, CAPEX reduction) but putting another (compliance/ governance) at potential risk. I thought audit/governance’s job was to be their eyes and ears?
I am not interested in a drill down into the rarified details of de-dupe, just looking for a judicious opinion of whether companies that use it are placing themselves at risk of noncompliance.
Let me know, Dan, whether I should take this elsewhere.
I read this message thread daily and learn a great deal. However, many of the points discussed here strike me as philosophical rather than pragmatic. Screwing with data is where the rubber meets the proverbial road. I find myself asking how can anyone involved in audit, risk, governance do their job properly if they don’t consider the technology issues that are in their faces today?
I got a lot of feedback from my post stating the issues as I understood them, and most of it was very encouraging. One fellow wrote,
From my point of view, Jon’s question and thread fit perfectly in this GOV DG …
And I personally find this thread more compelling (and practical) than a lot of recent threads, too …….
Someone else chimed in
Seems like a reasonable thread to me. It isn’t one that interests me, since I’m not a practitioner in this particular area, but e-mail are very easy to delete… or to file for future reference.
And another:
I believe effective governance includes ensuring compliance with applicable laws and regulations. This issue seems to be one that we could discuss on that basis. Maybe our lawyers (like Jack) can comment on the regulations and their implications for the organization, and some of our IT security experts can weigh in on how we achieve practical compliance.
So things started out fine. Then, they got confusing.
One fellow dismissed the issue as “a records management issue, not a governance issue.”
Another decided, I guess, to lay some legal groundwork, citing cases where information management practices did become germane to a lawsuit.
Yes, information protection is a governance issue. Here are two cautionary tales on this point:
1. Cobell v. Kempthorne, fka Cobell v. Norton. This was a seemingly endless class action by US Indian tribes against the Interior Department for mismanagement of tribal asset the Bureau of Indian Affairs had a fiduciary obligation to manage in the tribal members’ interest. The asset mismanagement case rapidly became sidetracked into an *information* mismanagement case: the ID had so implemented its information management systems and processes so badly that the court concluded (this January) that it wasn’t even possible to account for what the class members might be owed. The information management systems and processes pieced of the litigation included appointment of a special master to independently investigate, penetration testing etc. – even a court order requiring the ID to take the systems off the Internet due to their lack of basic security.
Now, imagine a comparable scenario applied to a private enterprise, where the organization lost control of mission-critical information due to a failure to provide proper information protection oversight . . . but wait, there’s more!
2. Oxford Health Plans. Several years ago – six? eight? – OHP decided to upgrade its claims payment systems. Since OHP is a health insurer these are mission critical systems and information. Unfortunately they lost control in the middle of the process and lost the ability to accurately pay claims because they couldn’t accurately identify and process the necessary information, so they paid providers based on estimates and past history. The fiduciary officers and the external auditors also failed to disclose this problem very clearly. Upshot: Derivative and class actions against OHP, officers and auditors – settlements (my recollection is) on the order of $300 million.
Definitely precedent suggesting strongly to me that information protection is a governance issue.
Another fellow cited his own research,
Jon,
That’s an interesting question.
About 4 years ago I did a paper on “Protecting Yourself Against E-illiteracy: Avoid Being Duped” and it made its way up to the front lines; I’m happy to say that a US Magistrate Judge said (off the record / out of court discussion) that the paper takes an interesting approach - would be prime for discussions and input at the “meet and confer session” before e-discovery begins under the FRCP.
Based on what those that talk about de-duplication often are the ones who are talking out of both sides of their mouth in litigation…as they want the opponent / someone to believe that de-duplication means “like or very similar” and it does not. Those a the same people who could not truthfully explain a MD5 hash if their incomes existed on it.
Now about my paper…after writing about it the having it posted on the internet and being criticized by an attorney who shall we say produces an annual industry study for money from those that he monitors for best practices…the paper was well received and then the FTC had an announcement that companies under investigation could not use “de-duplication” practices before first clearing it with both FTC’s counsel and IT group…that was most pleasurable for me.
Suggest most of you need to “lose” that phrase “their best judgment and reference their own risk management profile” because they and/or more importantly, those that read, or hear it rely on it to mean something they don’t understand but what the lawyers told them to say once upon a time as the regulators (as the FTC did) and courts will eventually wakeup to the intention “spoliation” of electronic information. This is from a yet to be released paper (I’m a co-author) that discusses the topic of spoliation. “While spoliation may include both negligent and intentional destruction or withholding of evidence, the courts generally presume that parties only destroy evidence that is harmful to their case—following the legal maxim, omnia reasumuntur contra spoliatorem, “all things are presumed against the spoliator.” CPAs face severe court sanctions if found guilty of spoilation, i.e., helping to hide or delete discoverable ESI during threatened or actual litigation.”
Don’t mean to blow my own horn here but I’m happy to answer specific questions if the you/group has any further questions or want the paper mentioned above.
You’re right about “how can anyone involved in audit, risk, governance do their job properly if they don’t consider the technology issues that are in their faces today?” then again those same people are never the ones to terminated themselves, right as have been saying that for too many years.
Risk avoidance and good corporate governance demands understanding technologies, then again the auditors of Enron, and ever other company that we read about…will continue to read about to a greater extent - as a result of insolvency/bankruptcies, felt they were doing a good job because they were paid well for not speaking up to begin with.
The technologies cannot alter the data or decide near duplicates or exactly who received what and when or eliminate the earlier versions (they are often more truthful than the final version) and that includes databases. You want to find the crooks take a look at the corrupt databases for starters.
Another guy answered that removing duplicates of files wasn’t a problem, so long as an original copy remained. (His interpretation of dedupe is strictly file level.)
Today, I posted this follow-up on Dan’s email reflector. I hope to share what I learn with readers here as responses come in.
Thanks to the group for the early support.
Here are a few perceptions to clear up right away.
First, deduplication is a marketing term, not a technical term. So, there is bound to be some slippage and confusion here. There are two categories of the technology right off the bat:
1. File level dedupe which involves the deletion of replicated files per policy and the maintenance of a single file image (original) and perhaps a backup and an archive copy. This is NOT what we are talking about in this question.
2. Block level de-duplication is the other approach. In its simplest form, it involves the substitution of a pointer or stub after data has been removed from a file. Look for a pattern of 1s and 0s across a broad range of files, pull it out and insert a stub. (Re-inflating the file involves replacement of the stub with the original 1s and 0s.)
Block level deduplication is at issue here. Every vendor has its own algorithm which it regards as secret sauce. There are no standards. Plus, there are at least two ways to implement it. One is called inline, because data is deduplicated as it streams to the destination disk drive. The other is called “post processing” because data is first written to disk, then the deduplication algorithm is applied. The former method is criticized by advocates of the latter as potentially damaging to data because of the speed and decision-less nature with which it is applied (with post processing, you can make a few decisions about what files to de-dupe and what files not to de-dupe; with in-line, everything traversing the wire is de-duped).
Bottom line, you now have a bunch of files that have been squeezed by pulling out their bits patterns and stubbing them. Advocates argue that this saves space occupied by data on disk, defers CAPEX spending for new disk, and lets you store more files on the same platter. All are good things. They also say that data is not changed, it is only being recorded differently to disk, so there is no problem with governance or compliance. It is just a different way to write data. Makes sense.
Detractors, and these are primarily vendors dissing other vendors, say that product A changes data, while product B does not. These critics are summarily panned by their competitors claiming that they are using Fear Uncertainty and Doubt (FUD) as a marketing ploy.
Truth be told, all of these mechanisms are changing data. The question is whether this modality of change violates legal requirements for maintaining a full and unaltered copy of data.
Two large financials that I support have decided to exclude certain data from deduplication – citing SEC and SARBOX “full and unmodified” or “full and unaltered” language as their reason. Others, the preponderance of companies using the technology, which is the current darling of the industry, are throwing everything into their dedupe kit.
Question 1: how does anyone know whether dedupe violates the rules? Do we need to wait for a test case?
Question 2: how does the governance/compliance officer evaluate and weigh this risk?
Question 3: should there be a formal process in a company to evaluate the impact or potential impact of new technology on the corporate information governance position?
Question 4: technologies like dedupe are slipping in under the radar as features of venerable storage products from trusted vendor suppliers and may be escaping notice of audit/compliance/governance. Plus they are new and untested. Plus they are not vetted by any standards or testing group in the industry. Plus most governance/compliance/and even records management folks don’t understand the technology to begin with (nor do many IT folk for that matter). Are we outsourcing our risk analysis to the vendor selling the gear and taking his word for its “safety” from the standpoint of risk? If so, name one other area of corporate operations where compliance decisionmaking is outsourced to the vendor of the product you buy?
Folks, I think this one is right up your alley. You are the braintrust I read to get a handle on current best practices. What do you think is the solution here? Forget about the technobabble and focus on the core issue: how much data transformation is permissible or wise?
Now I will shut up. The issues are on the table.
Stay tuned for more.

July 17th, 2008 at 1:48 pm
Block level deduplication is at issue here. Every vendor has its own algorithm which it regards as secret sauce. There are no standards.
There are standards.. standard interfaces on the front end, like NFS, CIFS and soon XAM. Claiming this is scary because the guts of the dedupe differs from vendor to vendor is a red herring; the guts of any storage system differs from vendor to vendor.
Look at hard drives — I could claim by the same token that there are no standards there! I can’t take the platters out of a Hitachi drive and put them in a Seagate drive. Each vendor has ECC codes and track formatting and magnetic domain shaping that they regard as secret sauce. That doesn’t make the disks any less reliable.
What matters here is the abstraction boundary provided by the user/application-facing interface, and if that maintains the integrity of the data, then it’s a reliable data storage solution. The bits you send to a hard drive never get written in that exact form on a disk — they get translated into an entirely different bit pattern by an ECC code. The bits that get written to tape with compression turned on aren’t the original source bits. And the bits on the drives in a deduplicating system are not the exact complement of bits sent to it. But, in all three cases, when you use the application interface you always, always, always get back exactly what you wrote in the first place.
And that’s what matters.