Prepare to zjoin ze gestalt...! Or GestaltIT tech field day preparation.
Some people commented on the win-win fact for both the person attending, and the company presenting. And yes, I agree with both. But a post made by Ocarina Networks on their blog gave me a different view on what would be my own benefits.
Basically they posted a challenge for the attendees, stating the following:
[q]In anticipation of this event, we’re challenging the attendees to bring us their toughest data set on a thumb drive. It can involve whatever files they want Ocarina to try to shrink–JPEGs, video, audio, PDFs, homeshares, email databases and so on. They will probably want to include several similar but not duplicate files, such as a series of PowerPoint files that contain some of the same slides but also different ones, or similar slides that have been edited.
On November 13 when the participants arrive, we’ll collect all the thumb drives. Then we’ll pick a few at random from a hat, and do a real time demonstration of the Ocarina ECOsystem compressing and deduping the files.[/url]
Now, as you might have guessed from the text in this quote, Ocarina Networks specializes in data deduplication. Basically they have a look at the data you want to store, find similar patterns and have them all point to one valid copy of the data. The process itself is a little more complex than that, but this short version should give you the gist of it.
Now, how did this change my mindset on what benefits these kind of challenges have to the attendees?
Simple. By challenging you, they are asking you to bring a tough dataset. They want to prove that this technology works, no matter how impossible the set of data may be when it comes to dedupe. And that's where the catch is.
Since most of the people attending like a good challenge (or so it would appear to me) and most of them are not the experts in dedupe, they need to investigate! Sure, bringing a thumb drive along with some data is no problem. But you probably want to make it challenging, so you need to investigate what kind of data gives dedupe solutions a hard time. You can look for reviews, blogs, tests and a lot more and you will find a lot of information on the various dedupe solution types (end-to-end, back-end, etc).
And that brings me to the point of this post. I want to give the guys from Ocarina a run for their money and I want to see what the product can do. So I searched for a dataset that is hard, and I created a fairly normal dataset. Both containing some structured and unstructured data and I am curious to see what the results will be. Independent from the results however I learned a lot about dedupe, not just from this vendor but in general, and that is to me one of the biggest personal benefits I can see when it comes to this event.
So let's see how this goes, and expect a longer post on dedupe after or during the GestaltIT tech field days! Only a little over a week left until we kick it in to a higher gear.
By the way, the first part of the title of this blog post was supposed to be slightly German sounding since "Gestalt" is an actual German word. Somehow I can't help but think of some sort of dodgy Monty Python sketch when talking about "the Gestalt".
Not exactly a clue what would be seen as a dataset but why not try 2 exactly the same files (like a sql dump), archive them both in exactly the same way, but one with a password or encryption attached?
A dataset can basically be anything, depending on where you would put the definition. But say for example you have a Powerpoint file and mail it to three people. One makes a small change and mails it back.
In a normal environment you would see that you have a grand total of 8 files lying around. 2 versions with 4 identical copies each. Why would you want to store each individual file? Why not store the original once and the modified version once and just put pointers or references to these stored files so that you don't consume so much disk space? Or perhaps, even go a step further and try to find which slides in the Powerpoint remained unchanged and use pointers there and only keep the changed slides in the modified version.
That is the basic concept of dedupe, and depending on where you would look (duplicate files, modified files, duplicate files on disk, duplicate bit patterns on disk, etc), that is where your dataset is defined.
Storing a number of very similar Windows VM's on a NetApp which is running dedupe and assume that the gain in storage is about 80% (so 20% individual bits, 80% similar bits). Now, defrag the VM's within the guest-OS. How will this affect dedupe, the storage needed at the moment of deduplication and the resulting similarity of the bits?
Bit level dedupe is a different story. You have the problems you mentioned. Another example would be using encryption on a drive. This usually also tends to kil any dedupe or thin provisioning efforts you made.
But I'll be sure to ask the question and report back with their comments!
Comments are closed