Tuesday, March 11, 2008

Inconceivable!

I was at a customer last week and got into a discussion about "dedupe" (data deduplication). It came up as a result of talking about reducing the amount of data stored. It was obvious to me that the marketing machines of the storage vendors are really getting good at this aspect of their job -- jumping on a buzzword, and obfuscating the intent of the original meaning of said word.

To wit: what is dedupe? First of all, if is an extremely hot phrase in the storage industry -- heck, one of the vendors actually has 'DDUP' as their ticker symbol. Second, in the movie 'The Princess Bride', the Spanish fencing master Inigo Montoya, played by Mandy Patinkin, in response to the short Sicilian criminal genius Vizzini (Wallace Shawn) repeating the word "Inconceivable!", replies "You keep using that word. I do not think it means what you think it means."

So, what does it mean?

According to Merriam-Webster, duplicate means: being the same as another. Therefore, deduplication would be the opposite; something that is unique. Which explains why this word can be used in the examples I'm about to give.

Single instance storage (SIS) traditionally meant that duplicated files are kept singularly, with pointers used to reference the needed location of the file. The idea being, if say, you created a 10MB document, then distributed it electronically to ten others on your LAN, each of whom stored it on a common server, that file would exist eleven times taking up 110MB. But with SIS, as long as those files are identical (duplicates), only one copy would be stored and pointers to the files for other users' reference. But what if each of the ten users made a 2 byte change to the file, then saved it. Each file would be unique, eating up 110MB of disk, regardless of SIS. SIS is called by many vendors a 'dedupe' technology. And it is, but at a file level. The problem with this is that all things gained at a file level are lost by the tiniest change in the file. Are SIS vendors wrong when they call their product dedupe? Technically, no -- but let's look at this a bit deeper.

In Brad O’Neill's seminal paper written in 2005, a Technology Brief titled "Introduction to Capacity Optimization", where he states "… the first principle by which CO (Capacity optimization) technologies approach the storing of blocks of data. Because they maintain maximally granular plans for all objects, they only need one instance of any given object. Most importantly, CO technologies only increase their efficiency the more they are used for storing content. This is true because over time, a capacity optimizing system is introduced to new objects, breaking those objects down into parts while maintaining all new plans in an optimized store…" The key piece here is that he is talking about blocks. Sub-file deduplication is far more efficient than SIS (for those who don't know him, Brad is a Senior Analyst and Consultant with Taneja Group Inc.) as it blends sub-file deduplication with compression, generally resulting in reduction in size by a factor of 20.


How does that efficiency translate? Let's use the example above of the ten copies of a 10MB file being distributed to other users on the LAN. As we saw above, as long as none of the copies of the files are altered in the slightest, this potential 110MBs of data will only take up 10MB using SIS. But, the slightest change causing each file to be unique, hence stored separately (no longer duplicates) and the 10MB balloons back to 110MB. What happens at sub-file deduplication? In the same scenario, as long as all files are identical, this method, too, will only take up 10MB, except, when compression is added, we reduce that further to ~5MB. Next, each of the files are changed by two bytes -- what is the net effect? At a sub-file level, depending on methodology (yet gaining the same results) either byte-level deltas are compared and only the bytes that have changed are stored, then compressed, or if block-level methodology is used, only the block that the changed bytes reside upon are stored, then compressed. In either case, instead of 110MB, slightly more than 5MB is stored. This is example shows, in gross terms, the key difference between SIS and sub-file dedupe.


So, is SIS dedupe? By definition, yes -- in our little world of data storage, that would be inconceivable!

2 comments:

Stinky Pinky said...

I'll tell you what's "inconceivable" - it's the thought that anyone would actually believe any of this horse shit!

There are so many inaccuracies in these posts that you should be ashamed of yourself.

Do you actually work in the industry? I hope not.

Tom Mumford said...

One of my goals in allowing posts from readers is to challenge and/or correct perceived inaccuracies. I welcome your input so that we may all learn from your experience.