Windows Server 2012 (R2) Deduplication and you
by Nathan.Fouarge, on May 26, 2015 8:36:32 AM
When someone mentions deduplication with regards to storage many people get blinded by dollar signs, very few people that I talk with know that they might have a perfectly good deduplication solution just sitting in their already provisioned server and not even know it. Starting with Windows Server 2012, and further enhanced in Windows Server 2012 R2, Microsoft included the option to enable data deduplication in their newly released storage groups. Many people did not even see this feature mentioned when Server 2012 came out, since there was much more to be excited about with the release of that operating system. It’s coming up to 3 years since Server 2012 was released and I still talk with people that have not heard of this feature so I wanted to shed a little light on what it can do for you with regards to a backup target.
What is data deduplication anyway?
Before I get too deep into what you can do with the deduplication with Server 2012 (R2), and how you can utilize it for storage for your backups, I want to briefly explain what data deduplication does for you and the methods how it can be implemented.
One of the first few things that get talked about with regards to software deduplication is whether it is inline or post-process data deduplication.
- Inline data deduplication means that the data is deduplicated before it is stored in its final location, normally this is software based. The machine that is sending the data usually takes the brunt of the CPU/memory resource cost in doing this deduplication inline, hence this method normally means the data transfer is slower to get to the location the data is being sent to.
- Post-process data deduplication is the process where all the data is normally sent to the location where it will be stored and then after the data is stored its blocks are checked to see if they are already stored and the duplicate blocks are discarded.
Another implementation step of data deduplication is where the data is actually deduplicated, either the source or on the target.
- With source data deduplication the location that the data is deduplicated in is typically the same file system that the data is originating from.
- Target based data deduplication occurs at the location where the data is being copied to, a normal use case for this is backup storage like what we are going to describe further down.
Finally, I want to talk about how the data is actually deduplicated, the normal methods for this is file, block, and chunk based.
- File based deduplication is based on whole files, a lot like Microsoft Exchange’s single instance storage, meaning I have two Excel spreadsheets that are exactly the same. Those are then deduplicated and only take up the storage of one of those documents. Though once one of those Excel spreadsheets is changed that deduplication is broken and now there is no space savings. File based deduplication is not that expensive resource wise, but it also doesn’t give you a whole lot of savings in most cases.
- Block based data deduplication uses a fixed block size in order to do deduplication. This is normally utilized on hardware storage arrays such as SANs and other higher end dedicated devices. Fixed block based data deduplication is normally very fast and gives a pretty good deduplication rate. There are some drawbacks though, if one byte of say a 1024KB block changes that entire block is now not deduplicated and you lose the storage savings of the other parts of that block. To reduce this overhead of wasted deduplication savings the block size is normally small, but then that takes more processing time.
- Finally the chunk based or variable block sized data deduplication is the last common method for comparing data to deduplicate it. This is just like the fixed block based method except it uses a variable block size depending on what the algorithm determines as the best block size to reduce wasted block size at that moment. Variable block size data deduplication is most often used by software based deduplication algorithms and file system based data deduplication. For example NovaBACKUP xSP’s FastBIT 3 algorithm is a variable block in order to reduce the amount of data that is sent to the Storage Server for offsite backup.
Deduplication with Windows Server 2012 (R2)
Now that you have a little bit of background on the different methods of data deduplication, it would be good to know what Microsoft implemented in Server 2012 (R2) for their data deduplication. The data deduplication that was implemented by Microsoft uses post-process, source, and chunk/variable block based methods. This means that the data has to reside on the machine that is doing the data deduplication first before reducing the size and that the chunking method will get you the best results. With that it means that there will be less CPU/memory resources used as all the data will be there before deduplication happens, but it also means you need to have enough space to hold all of the data in its original form. That last part is particularly important when you utilize a Server 2012 (R2) machine as the storage target for backup. Since you need to make sure that that backup storage target has enough space to handle the data from your backup before it is deduplicated.
Setup in Server 2012 (R2)
Installation and configuration of the data deduplication feature for Server 2012 (R2) has been covered by many different people and there are a number of resources out there for it. So I will not go over this, but I will point you to one of the articles that I think goes through pretty much everything you need to do: Redmondmag
Once you have the data deduplication feature installed and configured there are a couple of configuration changes that I suggest you do if you are utilizing the storage you just configured as storage for backup. The first is that you set the ‘deduplicated files older than (in days)’ to 0, so that when your optimization jobs run that they grab any and all files. Speaking of optimization jobs, since you are using this storage for backup storage purposes and most of your data transfer will be happening during the night, you will want to change the default schedules. I normally setup two jobs, one for the early morning where my backups should be done and another one that will finish right before my nightly backups kick off. Make sure to schedule both optimization and garbage collection jobs. Garbage collection and scrubbing I normally setup just once a week, but depending on your data size and how often things are deleted on your volume you might want to change that.
Setup in NovaStor DataCenter
If you are utilizing NovaStor DataCenter for the backup software, make sure to configure the disk pool for Dedup optimization. I have seen the 1024 Byte alignment get the best deduplication rates from my testing, so I would suggest this for the configuration in NovaStor DataCenter to start with. With my production setup I have a deduplication savings of 40TB, and the storage pool itself is 8TB, so very good with something that is bundled in with the OS you might already be running.
If you haven't already,sign up to receive information about the technology behind NovaStor DataCenter, NovaStor's technology partners, Webinar invitations, and general network backup and restore knowledge.
More information about NovaStor DataCenter here.