I know, I know. As a tech writer and blogger, I should be immune to the occasional pushback from readers who hold views that differ from my own. I usually have a pretty thick skin, but I really hate it when I take the time to respond to commenters only to have the comment facility into which I type my response fail when saving my work. This just happened when I spent 35 minutes writing a detailed response to the many comments received in response to an article I had written in January on de-duplication and compression for TechTarget’s SearchStorage.
While one or two comments agreed with my perspective, several commenters disagreed vehemently or sought to add other perspectives that weren’t part of my coverage. Aside from being accused of shoddy reportage by a commenter who referred to himself as TheStorageArchitect, most of the criticisms stayed on topic. I spent quite a bit of time crafting a response and, had the response facility worked, here is a synopsis of what I wrote.
First, a bit of background. The article was part of a series of tips on storage efficiency. I had argued in the series that storage efficiency came down to managing data and managing infrastructure so that we can achieve the twin goals of capacity allocation efficiency (allocating space to data and apps in a balanced and deliberate way that prevents a dreaded disk full error from taking down an app) and capacity utilization efficiency (allocating the right kind of storage to the right kind of data based on data’s business context, access/modification frequency, platform cost, etc.).
In this context, I argued that some vendor marketing materials and messages misrepresent the value of de-duplication and compression — technology that contributes on a short term or tactical basis to capacity allocation efficiency — as a means to achieve capacity utilization efficiency. I wasn’t seeking to join the tribalistic nonsense out there, claiming that XYZ vendor’s de-dupe kit was better than vendor ABC’s de-dupe kit. My key points are as follows.
- De-duplication remains a proprietary, rather than a standards-based, technology. That makes it a great value-add component that hardware vendors have used to jack up the price of otherwise commodity hardware. I cited the example of an early Data Domain rig with an MSRP of $410K for a box of about $3K of SATA drives whose price tag was justified on the basis of a promised data reduction rate that was never delivered or realized by any user I have interviewed. That, to my way of thinking, is one deficit of on-array de-duplicating storage appliances and VTLs. It would be alleviated to some degree when de-dupe is sold as software that can be used on any gear, or better yet as an open standards-based function of a file system, mainly so that users avoid proprietary vendor hardware lock-in. By the way, in response to one commenter, even if it is true that “all storage hardware companies are selling software,” I prefer as a rule to purchase the storage software functionality in an intelligent way that makes it extensible to all hardware platforms rather than limiting it to a specific kit. That, to me, is what smart folks mean when they say “software-defined storage” today.
- De-duplication is not a long term solution to the problem of unmanaged data growth. It is a technique for squeezing more junk into the junk drawer that, even with all of the “trash compacting” value, will still fill the junk drawer over time. From this perspective, it is tactical, not strategic, technology.
- The use of proprietary de-dupe technology mounted on array controllers limited, in many cases, the effect of de-duplication only to data stored on trays of drives controlled by that controller. Once the box of drives with the de-duplicating controller was filled, you needed to deploy another box of drives with another de-duplicating controller that needed to be managed separately. I think of this as the “isolated island of de-dupe storage” problem. Many of my clients have complained about this issue. Some commenters on the article correctly observed that some vendors, including NEC with its HydraStor platform, had scale-out capabilities in their hardware platform. True enough, but unless I am mistaken, even vendors that enable the numbers of trays of drives to scale out under the auspices of their controller still require that all kit be purchased from them. Isn’t that still a hardware lock in? My good friend, TheStorageArchitect, said that I should have distinguished between active de-dupe versus at-rest de-dupe. He has a point. If I had done so, I might have suggested that if you were planning to use de-dupe for something like squashing many full backups with a lot of replicated content into a smaller amount of disk space, an after-write de-dupe process, which can be gotten for free with just about any backup software today, might be a way to go. But I would also have caveated that, if running a VTL was intended to provide a platform for quick restore of individual files, using a de-duplicated backup data set might not be the right way to go since it would require the rehydration of the data on restore, introducing some potential delays in file restore. The strategy of at-rest dedupe in the VTL context also has me wondering why you wouldn’t use an alternative like LTFS tape or even incremental backups. As for in-line or active de-duplication, he stole my thunder with his correct assertion of the CPU demands of global de-duplication services. But I digress…
- My key point was that real capacity utilization efficiency is achieved not by tactical measures like data reduction, but by data management activities such as active and deep archive and the like. Archives probably shouldn’t use proprietary data containers that require proprietary data access technologies to be opened and accessed at some future time. Such technologies just introduce another set of headaches for the archivist, requiring data to be un-ingested and re-ingested every time a vendor changes his data reduction technology. This may change, of course, if de-dupe becomes an open standard integrated into all file systems.
I may have missed a few of the other points I made in my response to the comments on the TechTarget site, but I did want to clarify these points. Plus, I will offer to those who said that my claims that the star was fading over de-dupe were bogus, I can only offer what I am seeing. Many of my clients have abandoned de-duplication, either after failing to realize anything like the promised data reduction value touted by product vendors, or because of concerns about legal and regulatory permissibility of de-duplicated data. While advocates are quick to dismiss the question of “material alteration” of data by de-dupe processing, no financial firm I have visited wants to be the test case. That you haven’t seen more reportage on these issues is partly a function of hardware vendor gag-orders on consumers, prohibiting them under threat of voided warranties, from talking publicly about the performance they receive from the gear they buy.
If you like de-dupe, if it works for you and fits your needs, great! Who am I to tell you what to do? But if you are trying to get strategic about the problem of capacity demand growth, I would argue that data management provides a more strategic solution than simply putting your data into a trash compactor and placing the output into the same old junk drawer.
I am informed that my rejected comment submittal has miraculously appeared in the appropriate section of the TechTarget site. No one understands what happened or how it resolved itself, but there it is.
Second, the original version of this post named the fellow associated with the handle, TheStorageArchitect: Chris Evans. (No, not Captain America. The other Chris Evans.) But Chris, who I follow on Twitter, advised he that he did not make the post. So, I have redacted his name from the original post here. Apologies to Chris for the incorrect attribution.